linux-kernel - Re: [PATCH net] tcp: fix forever orphan socket caused by tcp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240806100243.269219-1-kuro@kuroa.me>
Date: Tue,  6 Aug 2024 18:02:43 +0800
From: Xueming Feng <kuro@...oa.me>
To: "David S . Miller" <davem@...emloft.net>,
	netdev@...r.kernel.org,
	Eric Dumazet <edumazet@...gle.com>,
	Lorenzo Colitti <lorenzo@...gle.com>,
	Jason Xing <kerneljasonxing@...il.com>
Cc: Neal Cardwell <ncardwell@...gle.com>,
	Yuchung Cheng <ycheng@...gle.com>,
	Soheil Hassas Yeganeh <soheil@...gle.com>,
	David Ahern <dsahern@...nel.org>,
	linux-kernel@...r.kernel.org,
	Xueming Feng <kuro@...oa.me>
Subject: Re: [PATCH net] tcp: fix forever orphan socket caused by tcp_abort

On Mon, Aug 5, 2024 at 7:14 PM Jason Xing <kerneljasonxing@...il.com> wrote:

> On Mon, Aug 5, 2024 at 6:34 PM Xueming Feng <kuro@...oa.me> wrote:
> >
> > On Mon, Aug 5, 2024 at 4:04 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Mon, Aug 5, 2024 at 3:23 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Mon, Aug 5, 2024 at 6:52 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > >
> > > > > On Sat, Aug 3, 2024 at 11:48 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > Hello Eric,
> > > > > >
> > > > > > On Thu, Aug 1, 2024 at 9:17 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > >
> > > > > > > On Thu, Aug 1, 2024 at 1:17 PM Xueming Feng <kuro@...oa.me> wrote:
> > > > > > > >
> > > > > > > > We have some problem closing zero-window fin-wait-1 tcp sockets in our
> > > > > > > > environment. This patch come from the investigation.
> > > > > > > >
> > > > > > > > Previously tcp_abort only sends out reset and calls tcp_done when the
> > > > > > > > socket is not SOCK_DEAD aka. orphan. For orphan socket, it will only
> > > > > > > > purging the write queue, but not close the socket and left it to the
> > > > > > > > timer.
> > > > > > > >
> > > > > > > > While purging the write queue, tp->packets_out and sk->sk_write_queue
> > > > > > > > is cleared along the way. However tcp_retransmit_timer have early
> > > > > > > > return based on !tp->packets_out and tcp_probe_timer have early
> > > > > > > > return based on !sk->sk_write_queue.
> > > > > > > >
> > > > > > > > This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
> > > > > > > > and socket not being killed by the timers. Converting a zero-windowed
> > > > > > > > orphan to a forever orphan.
> > > > > > > >
> > > > > > > > This patch removes the SOCK_DEAD check in tcp_abort, making it send
> > > > > > > > reset to peer and close the socket accordingly. Preventing the
> > > > > > > > timer-less orphan from happening.
> > > > > > > >
> > > > > > > > Fixes: e05836ac07c7 ("tcp: purge write queue upon aborting the connection")
> > > > > > > > Fixes: bffd168c3fc5 ("tcp: clear tp->packets_out when purging write queue")
> > > > > > > > Signed-off-by: Xueming Feng <kuro@...oa.me>
> > > > > > >
> > > > > > > This seems legit, but are you sure these two blamed commits added this bug ?
> >
> > My bad, I wasn't sure about the intend of the original commit that did not
> > handle orphan sockets at the time of blaming, should have asked.
> >
> > > > > > >
> > > > > > > Even before them, we should have called tcp_done() right away, instead
> > > > > > > of waiting for a (possibly long) timer to complete the job.
> > > > > > >
> > > > > > > This might be important when killing millions of sockets on a busy server.
> > > > > > >
> > > > > > > CC Lorenzo
> > > > > > >
> > > > > > > Lorenzo, do you recall why your patch was testing the SOCK_DEAD flag ?
> > > > > >
> > > > > > I guess that one of possible reasons is to avoid double-free,
> > > > > > something like this, happening in inet_csk_destroy_sock().
> > > > > >
> > > > > > Let me assume: if we call tcp_close() first under the memory pressure
> > > > > > which means tcp_check_oom() returns true and then it will call
> > > > > > inet_csk_destroy_sock() in __tcp_close(), later tcp_abort() will call
> > > > > > tcp_done() to free the sk again in the inet_csk_destroy_sock() when
> > > > > > not testing the SOCK_DEAD flag in tcp_abort.
> > > > > >
> > > > >
> > > > > How about this one which can prevent double calling
> > > > > inet_csk_destroy_sock() when we call destroy and close nearly at the
> > > > > same time under that circumstance:
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index e03a342c9162..d5d3b21cc824 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -4646,7 +4646,7 @@ int tcp_abort(struct sock *sk, int err)
> > > > >         local_bh_disable();
> > > > >         bh_lock_sock(sk);
> > > > >
> > > > > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > > > > +       if (sk->sk_state != TCP_CLOSE) {
> > > > >                 if (tcp_need_reset(sk->sk_state))
> > > > >                         tcp_send_active_reset(sk, GFP_ATOMIC,
> > > > >                                               SK_RST_REASON_NOT_SPECIFIED);
> > > > >
> > > > > Each time we call inet_csk_destroy_sock(), we must make sure we've
> > > > > already set the state to TCP_CLOSE. Based on this, I think we can use
> > > > > this as an indicator to avoid calling twice to destroy the socket.
> > > >
> > > > I do not think this will work.
> > > >
> > > > With this patch, a listener socket will not get an error notification.
> > >
> > > Oh, you're right.
> > >
> > > I think we can add this particular case in the if or if-else statement
> > > to handle.
> > >
> > > Thanks,
> > > Jason
> >
> > Summarizing above conversation, I've made a v2-ish patch, which should
> > handles the double abort, and removes redundent tcp_write_queue_purge.
> > Please take a look, in the meanwhile, I will see if I can make a test
> > for tcp_abort. If this looks good, I will make a formal v2 patch.
> >
> > Any advice is welcomed. (Including on how to use this mail thread, I don't
> > have much experience with it.)
> >
> > Signed-off-by: Xueming Feng <kuro@...oa.me>
> > ---
> >  net/ipv4/tcp.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index e03a342c9162..039a9c9301b7 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4614,6 +4614,10 @@ int tcp_abort(struct sock *sk, int err)
> >  {
> >         int state = inet_sk_state_load(sk);
> >
> > +       /* Avoid force-closing the same socket twice. */
> > +       if (state == TCP_CLOSE) {
> > +               return 0;
> > +       }
> 
> No need to add brackets here.
> 
> And I don't think it's a correct position to test and return because
> we need to take care of the race condition when accessing lock_sock()
> by tcp_abort() and __tcp_close(). What if tcp_abort() just passes the
> check and then tcp_close() rapidly grabs the socket lock and sets the
> state to TCP_CLOSE?

Below is the patch changed according to your advice. The test now happens
after the lock_sock and will return -ENOENT if the socket has already been
closed by someone else.

About the tests, I have some script that helps me to test the situation.
But after reading about KUnit framework, I could not find any current 
example for TCP testing. Could anyone enlighten me?


Signed-off-by: Xueming Feng <kuro@...oa.me>
---
 net/ipv4/tcp.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03a342c9162..831a18dc7aa6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
 		/* Don't race with userspace socket closes such as tcp_close. */
 		lock_sock(sk);
 
+	/* Avoid closing the same socket twice. */
+	if (sk->sk_state == TCP_CLOSE) {
+		if (!has_current_bpf_ctx())
+			release_sock(sk);
+		return -ENOENT;
+	}
+
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
 		inet_csk_listen_stop(sk);
@@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
 	local_bh_disable();
 	bh_lock_sock(sk);
 
-	if (!sock_flag(sk, SOCK_DEAD)) {
-		if (tcp_need_reset(sk->sk_state))
-			tcp_send_active_reset(sk, GFP_ATOMIC,
-					      SK_RST_REASON_NOT_SPECIFIED);
-		tcp_done_with_error(sk, err);
-	}
+	if (tcp_need_reset(sk->sk_state))
+		tcp_send_active_reset(sk, GFP_ATOMIC,
+				      SK_RST_REASON_NOT_SPECIFIED);
+	tcp_done_with_error(sk, err);
 
 	bh_unlock_sock(sk);
 	local_bh_enable();
-	tcp_write_queue_purge(sk);
 	if (!has_current_bpf_ctx())
 		release_sock(sk);
 	return 0;
--