linux-kernel - Re: [PATCH] tcp: check socket state before calling WARN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=cB5LW8EtP5Hs6Xb8HP7Hr7TzHSHx--c7yj50RhUraUEQ@mail.gmail.com>
Date: Tue, 25 Feb 2025 10:28:06 -0800
From: Yuchung Cheng <ycheng@...gle.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: Youngmin Nam <youngmin.nam@...sung.com>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net, dsahern@...nel.org, 
	pabeni@...hat.com, horms@...nel.org, guo88.liu@...sung.com, 
	yiwang.cai@...sung.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	joonki.min@...sung.com, hajun.sung@...sung.com, d7271.choe@...sung.com, 
	sw.ju@...sung.com, "Dujeong.lee" <dujeong.lee@...sung.com>, Kevin Yang <yyd@...gle.com>, 
	Xueming Feng <kuro@...oa.me>
Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON

On Tue, Feb 25, 2025 at 9:25 AM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@...sung.com> wrote:
> > >
> > > > Hi Neal,
> > > > Thank you for looking into this issue.
> > > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > > We can provide any information you would like to inspect.
> >
> > Thanks again for raising this issue, and providing all that data!
> >
> > I've come up with a reproducer for this issue, and an explanation for
> > why this has only been seen on Android so far, and a theory about a
> > related socket leak issue, and a proposed fix for the WARN and the
> > socket leak.
> >
> > Here is the scenario:
> >
> > + user process A has a socket in TCP_ESTABLISHED
> >
> > + user process A calls close(fd)
> >
> > + socket calls __tcp_close() and tcp_close_state() decides to enter
> > TCP_FIN_WAIT1 and send a FIN
> >
> > + FIN is lost and retransmitted, making the state:
> > ---
> >  tp->packets_out = 1
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > + someone invokes "ss" to --kill the socket using the functionality in
> > (1e64e298b8 "net: diag: Support destroying TCP sockets")
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
> >
> >  (note: this was added for Android, so would not be surprising to have
> > this inet_diag --kill run on Android)
> >
> > + the ss --kill causes a call to tcp_abort()
> >
> > + tcp_abort() calls tcp_write_queue_purge()
> >
> > + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> > retrans_out=1
> >
> > + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
> >
> > + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
> >
> > #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
> >
> > because the state has:
> >
> >  ---
> >  tcp_left_out(tp) = sacked_out + lost_out = 1
> >   tp->packets_out = 0
> > ---
> >
> > because the state is:
> >
> > ---
> >  tp->packets_out = 0
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > I guess perhaps one fix would be to just have tcp_write_queue_purge()
> > zero out those other fields:
> >
> > ---
> >  tp->sacked_out = 0
> >  tp->lost_out = 0
> >  tp->retrans_out = 0
> > ---
> >
> > However, there is a related and worse problem. Because this killed
> > socket has tp->packets_out, the next time the RTO timer fires,
Zeroing all inflights stats in tcp_write_queue_purge still makes sense
to me. Why will the RTO timer still fire if packets_out is zeroed?


> > tcp_retransmit_timer() notices !tp->packets_out is true, so it short
> > circuits and returns without setting another RTO timer or checking to
> > see if the socket should be deleted. So the tcp_sock is now sitting in
> > memory with no timer set to delete it. So we could leak a socket this
> > way. So AFAICT to fix this socket leak problem, perhaps we want a
> > patch like the following (not tested yet), so that we delete all
> > killed sockets immediately, whether they are SOCK_DEAD (orphans for
> > which the user already called close() or not) :
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 28cf19317b6c2..a266078b8ec8c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
> >         local_bh_disable();
> >         bh_lock_sock(sk);
> >
> > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > -               if (tcp_need_reset(sk->sk_state))
> > -                       tcp_send_active_reset(sk, GFP_ATOMIC);
> > -               tcp_done_with_error(sk, err);
> > -       }
> > +       if (tcp_need_reset(sk->sk_state))
> > +               tcp_send_active_reset(sk, GFP_ATOMIC);
> > +       tcp_done_with_error(sk, err);
> >
> >         bh_unlock_sock(sk);
> >         local_bh_enable();
> > -       tcp_write_queue_purge(sk);
> >         release_sock(sk);
> >         return 0;
> >  }
> > ---
>
> Actually, it seems like a similar fix was already merged into Linux v6.11:
>
> bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort
>
> Details below.
>
> Youngmin, does your kernel have this bac76cf89816b fix? If not, can
> you please cherry-pick this fix and retest?
>
> Thanks!
> neal
>
> ps: details for bac76cf89816b:
>
> commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
> Author: Xueming Feng <kuro@...oa.me>
> Date:   Mon Aug 26 18:23:27 2024 +0800
>
>     tcp: fix forever orphan socket caused by tcp_abort
>
>     We have some problem closing zero-window fin-wait-1 tcp sockets in our
>     environment. This patch come from the investigation.
>
>     Previously tcp_abort only sends out reset and calls tcp_done when the
>     socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
>     purging the write queue, but not close the socket and left it to the
>     timer.
>
>     While purging the write queue, tp->packets_out and sk->sk_write_queue
>     is cleared along the way. However tcp_retransmit_timer have early
>     return based on !tp->packets_out and tcp_probe_timer have early
>     return based on !sk->sk_write_queue.
>
>     This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
>     and socket not being killed by the timers, converting a zero-windowed
>     orphan into a forever orphan.
>
>     This patch removes the SOCK_DEAD check in tcp_abort, making it send
>     reset to peer and close the socket accordingly. Preventing the
>     timer-less orphan from happening.
>
>     According to Lorenzo's email in the v1 thread, the check was there to
>     prevent force-closing the same socket twice. That situation is handled
>     by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
>     already closed.
>
>     The -ENOENT code comes from the associate patch Lorenzo made for
>     iproute2-ss; link attached below, which also conform to RFC 9293.
>
>     At the end of the patch, tcp_write_queue_purge(sk) is removed because it
>     was already called in tcp_done_with_error().
>
>     p.s. This is the same patch with v2. Resent due to mis-labeled "changes
>     requested" on patchwork.kernel.org.
>
>     Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
>     Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
>     Signed-off-by: Xueming Feng <kuro@...oa.me>
>     Tested-by: Lorenzo Colitti <lorenzo@...gle.com>
>     Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
>     Reviewed-by: Eric Dumazet <edumazet@...gle.com>
>     Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
>     Signed-off-by: Jakub Kicinski <kuba@...nel.org>
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162b..831a18dc7aa6d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
>                 /* Don't race with userspace socket closes such as tcp_close. */
>                 lock_sock(sk);
>
> +       /* Avoid closing the same socket twice. */
> +       if (sk->sk_state == TCP_CLOSE) {
> +               if (!has_current_bpf_ctx())
> +                       release_sock(sk);
> +               return -ENOENT;
> +       }
> +
>         if (sk->sk_state == TCP_LISTEN) {
>                 tcp_set_state(sk, TCP_CLOSE);
>                 inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
>
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> -                                             SK_RST_REASON_NOT_SPECIFIED);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC,
> +                                     SK_RST_REASON_NOT_SPECIFIED);
> +       tcp_done_with_error(sk, err);
>
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         if (!has_current_bpf_ctx())
>                 release_sock(sk);
>         return 0;