linux-kernel - Re: [PATCH] tcp: check socket state before calling WARN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8KcXQhdRId1S6w8@perf>
Date: Sat, 1 Mar 2025 14:37:23 +0900
From: Youngmin Nam <youngmin.nam@...sung.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
	davem@...emloft.net, dsahern@...nel.org, pabeni@...hat.com,
	horms@...nel.org, guo88.liu@...sung.com, yiwang.cai@...sung.com,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	joonki.min@...sung.com, hajun.sung@...sung.com, d7271.choe@...sung.com,
	sw.ju@...sung.com, "Dujeong.lee" <dujeong.lee@...sung.com>, Yuchung Cheng
	<ycheng@...gle.com>, Kevin Yang <yyd@...gle.com>, Xueming Feng
	<kuro@...oa.me>, Youngmin Nam <youngmin.nam@...sung.com>,
	cmllamas@...gle.com, willdeacon@...gle.com, maennich@...gle.com
Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON

On Tue, Feb 25, 2025 at 12:24:47PM -0500, Neal Cardwell wrote:
> On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@...sung.com> wrote:
> > >
> > > > Hi Neal,
> > > > Thank you for looking into this issue.
> > > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > > We can provide any information you would like to inspect.
> >
> > Thanks again for raising this issue, and providing all that data!
> >
> > I've come up with a reproducer for this issue, and an explanation for
> > why this has only been seen on Android so far, and a theory about a
> > related socket leak issue, and a proposed fix for the WARN and the
> > socket leak.
> >
> > Here is the scenario:
> >
> > + user process A has a socket in TCP_ESTABLISHED
> >
> > + user process A calls close(fd)
> >
> > + socket calls __tcp_close() and tcp_close_state() decides to enter
> > TCP_FIN_WAIT1 and send a FIN
> >
> > + FIN is lost and retransmitted, making the state:
> > ---
> >  tp->packets_out = 1
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > + someone invokes "ss" to --kill the socket using the functionality in
> > (1e64e298b8 "net: diag: Support destroying TCP sockets")
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
> >
> >  (note: this was added for Android, so would not be surprising to have
> > this inet_diag --kill run on Android)
> >
> > + the ss --kill causes a call to tcp_abort()
> >
> > + tcp_abort() calls tcp_write_queue_purge()
> >
> > + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> > retrans_out=1
> >
> > + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
> >
> > + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
> >
> > #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
> >
> > because the state has:
> >
> >  ---
> >  tcp_left_out(tp) = sacked_out + lost_out = 1
> >   tp->packets_out = 0
> > ---
> >
> > because the state is:
> >
> > ---
> >  tp->packets_out = 0
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > I guess perhaps one fix would be to just have tcp_write_queue_purge()
> > zero out those other fields:
> >
> > ---
> >  tp->sacked_out = 0
> >  tp->lost_out = 0
> >  tp->retrans_out = 0
> > ---
> >
> > However, there is a related and worse problem. Because this killed
> > socket has tp->packets_out, the next time the RTO timer fires,
> > tcp_retransmit_timer() notices !tp->packets_out is true, so it short
> > circuits and returns without setting another RTO timer or checking to
> > see if the socket should be deleted. So the tcp_sock is now sitting in
> > memory with no timer set to delete it. So we could leak a socket this
> > way. So AFAICT to fix this socket leak problem, perhaps we want a
> > patch like the following (not tested yet), so that we delete all
> > killed sockets immediately, whether they are SOCK_DEAD (orphans for
> > which the user already called close() or not) :
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 28cf19317b6c2..a266078b8ec8c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
> >         local_bh_disable();
> >         bh_lock_sock(sk);
> >
> > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > -               if (tcp_need_reset(sk->sk_state))
> > -                       tcp_send_active_reset(sk, GFP_ATOMIC);
> > -               tcp_done_with_error(sk, err);
> > -       }
> > +       if (tcp_need_reset(sk->sk_state))
> > +               tcp_send_active_reset(sk, GFP_ATOMIC);
> > +       tcp_done_with_error(sk, err);
> >
> >         bh_unlock_sock(sk);
> >         local_bh_enable();
> > -       tcp_write_queue_purge(sk);
> >         release_sock(sk);
> >         return 0;
> >  }
> > ---
> 
> Actually, it seems like a similar fix was already merged into Linux v6.11:
> 
> bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort
> 
> Details below.
> 
> Youngmin, does your kernel have this bac76cf89816b fix? If not, can
> you please cherry-pick this fix and retest?
> 
> Thanks!
> neal

Hi Neal.

Thank you for your effort in debugging this issue with me.
I also appreciate your detailed explanation and for finding the patch related to the issue.

Our kernel(an Android kernel based on 6.6 LTS) does not have the patch you mentioned.(bac76cf89816b)

I'll let you know the test results after applying the patch.

Thank you.

> 
> ps: details for bac76cf89816b:
> 
> commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
> Author: Xueming Feng <kuro@...oa.me>
> Date:   Mon Aug 26 18:23:27 2024 +0800
> 
>     tcp: fix forever orphan socket caused by tcp_abort
> 
>     We have some problem closing zero-window fin-wait-1 tcp sockets in our
>     environment. This patch come from the investigation.
> 
>     Previously tcp_abort only sends out reset and calls tcp_done when the
>     socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
>     purging the write queue, but not close the socket and left it to the
>     timer.
> 
>     While purging the write queue, tp->packets_out and sk->sk_write_queue
>     is cleared along the way. However tcp_retransmit_timer have early
>     return based on !tp->packets_out and tcp_probe_timer have early
>     return based on !sk->sk_write_queue.
> 
>     This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
>     and socket not being killed by the timers, converting a zero-windowed
>     orphan into a forever orphan.
> 
>     This patch removes the SOCK_DEAD check in tcp_abort, making it send
>     reset to peer and close the socket accordingly. Preventing the
>     timer-less orphan from happening.
> 
>     According to Lorenzo's email in the v1 thread, the check was there to
>     prevent force-closing the same socket twice. That situation is handled
>     by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
>     already closed.
> 
>     The -ENOENT code comes from the associate patch Lorenzo made for
>     iproute2-ss; link attached below, which also conform to RFC 9293.
> 
>     At the end of the patch, tcp_write_queue_purge(sk) is removed because it
>     was already called in tcp_done_with_error().
> 
>     p.s. This is the same patch with v2. Resent due to mis-labeled "changes
>     requested" on patchwork.kernel.org.
> 
>     Link: https://protect2.fireeye.com/v1/url?k=544c1d82-0bd7257f-544d96cd-000babff317b-bdb81ce9ab3ea266&q=1&e=a6f04ac5-af96-4431-b73d-76d141ecd941&u=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fnetdev%2Fpatch%2F1450773094-7978-3-git-send-email-lorenzo%40google.com%2F
>     Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
>     Signed-off-by: Xueming Feng <kuro@...oa.me>
>     Tested-by: Lorenzo Colitti <lorenzo@...gle.com>
>     Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
>     Reviewed-by: Eric Dumazet <edumazet@...gle.com>
>     Link: https://protect2.fireeye.com/v1/url?k=4a9f6303-15045bfe-4a9ee84c-000babff317b-4ccbbea72f6265df&q=1&e=a6f04ac5-af96-4431-b73d-76d141ecd941&u=https%3A%2F%2Fpatch.msgid.link%2F20240826102327.1461482-1-kuro%40kuroa.me
>     Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162b..831a18dc7aa6d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
>                 /* Don't race with userspace socket closes such as tcp_close. */
>                 lock_sock(sk);
> 
> +       /* Avoid closing the same socket twice. */
> +       if (sk->sk_state == TCP_CLOSE) {
> +               if (!has_current_bpf_ctx())
> +                       release_sock(sk);
> +               return -ENOENT;
> +       }
> +
>         if (sk->sk_state == TCP_LISTEN) {
>                 tcp_set_state(sk, TCP_CLOSE);
>                 inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
> 
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> -                                             SK_RST_REASON_NOT_SPECIFIED);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC,
> +                                     SK_RST_REASON_NOT_SPECIFIED);
> +       tcp_done_with_error(sk, err);
> 
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         if (!has_current_bpf_ctx())
>                 release_sock(sk);
>         return 0;
>