[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKVQ=c8zxm0MqR7ycR1RFbKqObEPEJrpWCfxH4MdVf3Og@mail.gmail.com>
Date: Thu, 28 Aug 2025 12:43:36 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: dima@...sta.com
Cc: Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...gle.com>,
"David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Bob Gilligan <gilligan@...sta.com>, Salam Noureddine <noureddine@...sta.com>,
Dmitry Safonov <0x7f454c46@...il.com>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v2 1/2] tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
On Thu, Aug 28, 2025 at 1:15 AM Dmitry Safonov via B4 Relay
<devnull+dima.arista.com@...nel.org> wrote:
>
> From: Dmitry Safonov <dima@...sta.com>
>
> Currently there are a couple of minor issues with destroying the keys
> tcp_v4_destroy_sock():
>
> 1. The socket is yet in TCP bind buckets, making it reachable for
> incoming segments [on another CPU core], potentially available to send
> late FIN/ACK/RST replies.
>
> 2. There is at least one code path, where tcp_done() is called before
> sending RST [kudos to Bob for investigation]. This is a case of
> a server, that finished sending its data and just called close().
>
> The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
> __tcp_close())
>
> tcp_v4_do_rcv()/tcp_v6_do_rcv()
> tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */
> tcp_reset()
> tcp_done_with_error()
> tcp_done()
> inet_csk_destroy_sock() /* Destroys AO/MD5 keys */
> /* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */
> tcp_v4_send_reset() /* Sends an unsigned RST segment */
>
> tcpdump:
> > 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60)
> > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0
> > 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72)
> > 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0
> > 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
> > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0
> > 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
> > 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
> > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0
>
> Note the signed RSTs later in the dump - those are sent by the server
> when the fin-wait socket gets removed from hash buckets, by
> the listener socket.
>
> Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
> slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
> socket can yet send non-data replies, they should be signed in order for
> the peer to process them. Now it also matches how AO/MD5 gets destructed
> for TIME-WAIT sockets (in tcp_twsk_destructor()).
>
> This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
> open problem: once RST get sent and socket gets actually destructed,
> there is no information on the initial sequence numbers. So, in case
> this last RST gets lost in the network, the server's listener socket
> won't be able to properly sign another RST. Nothing in RFC 1122
> prescribes keeping any local state after non-graceful reset.
> Luckily, BGP are known to use keep alive(s).
>
> While the issue is quite minor/cosmetic, these days monitoring network
> counters is a common practice and getting invalid signed segments from
> a trusted BGP peer can get customers worried.
>
> Investigated-by: Bob Gilligan <gilligan@...sta.com>
> Signed-off-by: Dmitry Safonov <dima@...sta.com>
> ---
> include/net/tcp.h | 4 ++++
> net/ipv4/tcp.c | 27 +++++++++++++++++++++++++++
> net/ipv4/tcp_ipv4.c | 33 ++++++++-------------------------
> net/ipv6/tcp_ipv6.c | 8 ++++++++
> 4 files changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 2936b8175950faa777f81f3c6b7230bcc375d772..0009c26241964b54aa93bc1b86158050d96c2c98 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1931,6 +1931,7 @@ tcp_md5_do_lookup_any_l3index(const struct sock *sk,
> }
>
> #define tcp_twsk_md5_key(twsk) ((twsk)->tw_md5_key)
> +void tcp_md5_destruct_sock(struct sock *sk);
> #else
> static inline struct tcp_md5sig_key *
> tcp_md5_do_lookup(const struct sock *sk, int l3index,
> @@ -1947,6 +1948,9 @@ tcp_md5_do_lookup_any_l3index(const struct sock *sk,
> }
>
> #define tcp_twsk_md5_key(twsk) NULL
> +static inline void tcp_md5_destruct_sock(struct sock *sk)
> +{
> +}
> #endif
>
> int tcp_md5_alloc_sigpool(void);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9bc8317e92b7952871f07ae11a9c2eaa7d3a9e65..927233ee7500e0568782ae4a3860af56d1476acd 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -412,6 +412,33 @@ static u64 tcp_compute_delivery_rate(const struct tcp_sock *tp)
> return rate64;
> }
>
> +#ifdef CONFIG_TCP_MD5SIG
> +static void tcp_md5sig_info_free_rcu(struct rcu_head *head)
> +{
> + struct tcp_md5sig_info *md5sig;
> +
> + md5sig = container_of(head, struct tcp_md5sig_info, rcu);
> + kfree(md5sig);
> + static_branch_slow_dec_deferred(&tcp_md5_needed);
> + tcp_md5_release_sigpool();
> +}
> +
> +void tcp_md5_destruct_sock(struct sock *sk)
> +{
> + struct tcp_sock *tp = tcp_sk(sk);
> +
> + if (tp->md5sig_info) {
> + struct tcp_md5sig_info *md5sig;
> +
> + md5sig = rcu_dereference_protected(tp->md5sig_info, 1);
> + tcp_clear_md5_list(sk);
> + call_rcu(&md5sig->rcu, tcp_md5sig_info_free_rcu);
> + rcu_assign_pointer(tp->md5sig_info, NULL);
I would move this line before call_rcu(&md5sig->rcu, tcp_md5sig_info_free_rcu),
otherwise the free could happen before the clear, and an UAF could occur.
It is not absolutely clear if this function runs under rcu_read_lock(),
and even if it is currently safe, this could change in the future.
Other than that :
Reviewed-by: Eric Dumazet <edumazet@...gle.com>
Powered by blists - more mailing lists