[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250822060254.74708-1-mii.w@linux.alibaba.com>
Date: Fri, 22 Aug 2025 14:02:54 +0800
From: 'MingMing Wang' <mii.w@...ux.alibaba.com>
To: edumazet@...gle.com,
ncardwell@...gle.com,
kuniyu@...gle.com,
davem@...emloft.net,
dsahern@...nel.org,
kuba@...nel.org,
pabeni@...hat.com,
horms@...nel.org,
ycheng@...gle.com
Cc: netdev@...r.kernel.org,
linux-kernel@...r.kernel.org,
MingMing Wang <mii.w@...ux.alibaba.com>,
Dust Li <dust.li@...ux.alibaba.com>
Subject: [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
From: MingMing Wang <mii.w@...ux.alibaba.com>
An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
if the following conditions are met:
1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
2. The peer advertises a zero window, and the window never reopens.
Steps to reproduce:
1. Set up two instances with nmap installed: one will act as the server
the other as the client
2. Execute on the server:
a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
b. start a listener: `nc -l -p 1234`
3. Execute on the client:
a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
c. after five seconds, stop the process: `killall nc`
4. Execute on the server: `killall -STOP nc`
5. Expected abnormal result: using `ss` command, we'll notice that the
client connection remains stuck in the FIN_WAIT1 state, and the
backoff counter always be 8 and no longer increased, as shown below:
```
FIN-WAIT-1 0 1389 172.16.0.2:50316 172.16.0.1:1234
cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
... other fields omitted ...
```
6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
state will be forcefully reclaimed after about 5 minutes.
During the zero-window probe retry process, it will check whether the
current connection is alive or not. If the connection is not alive and
the counter of retries exceeds the maximum allowed `max_probes`, retry
process will be terminated.
In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
according to the current implementation, the `icsk->icsk_backoff` counter
will be capped at `net.ipv4.tcp_retries2`. The value calculated by
`inet_csk_rto_backoff` will always be too small, which means the
computed backoff duration will always be less than rto_max. As a result,
the alive check will always return true. The condition before the
`goto abort` statement is an logical AND condition, the abort branch
can never be reached.
So, the TCP retransmission backoff mechanism has two issues:
1. `icsk->icsk_backoff` should monotonically increase during probe
transmission and, upon reaching the maximum backoff limit, the
connection should be terminated. However, the backoff value itself
must not be capped prematurely — it should only control when to abort.
2. The condition for orphaned connection abort was incorrectly based on
connection liveness and probe count. It should instead consider whether
the number of orphaned probes exceeds the intended limit.
To fix this, introduce a local variable `orphan_probes` to track orphan
probe attempts separately from `max_probes`, which is used for RTO
retransmissions. This decouples the two counters and prevents accidental
overwrites, ensuring correct timeout behavior for orphaned connections.
Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
Co-developed-by: Dust Li <dust.li@...ux.alibaba.com>
Signed-off-by: Dust Li <dust.li@...ux.alibaba.com>
Co-developed-by: MingMing Wang <mii.w@...ux.alibaba.com>
Signed-off-by: MingMing Wang <mii.w@...ux.alibaba.com>
---
We couldn't determine the rationale behind the following check in tcp_send_probe0():
```
if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
icsk->icsk_backoff++;
```
This condition appears to be the root cause of the observed stall.
However, it has existed in the kernel for over 20 years — which suggests
there might be a historical or subtle reason for its presence.
We would greatly appreciate it if anyone could shed
---
net/ipv4/tcp_output.c | 4 +---
net/ipv4/tcp_timer.c | 4 ++--
2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index caf11920a878..21795d696e38 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
- struct net *net = sock_net(sk);
unsigned long timeout;
int err;
@@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
icsk->icsk_probes_out++;
if (err <= 0) {
- if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
- icsk->icsk_backoff++;
+ icsk->icsk_backoff++;
timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
} else {
/* If packet was not sent due to local congestion,
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index a207877270fb..4dba2928e1bf 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
if (sock_flag(sk, SOCK_DEAD)) {
unsigned int rto_max = tcp_rto_max(sk);
const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
+ int orphan_probes = tcp_orphan_retries(sk, alive);
- max_probes = tcp_orphan_retries(sk, alive);
- if (!alive && icsk->icsk_backoff >= max_probes)
+ if (!alive || icsk->icsk_backoff >= orphan_probes)
goto abort;
if (tcp_out_of_resources(sk, true))
return;
--
2.46.0
Powered by blists - more mailing lists