[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250117192859.28252-1-dzq.aishenghu0@gmail.com>
Date: Sat, 18 Jan 2025 03:28:59 +0800
From: Zhongqiu Duan <dzq.aishenghu0@...il.com>
To: netdev@...r.kernel.org
Cc: Jason Xing <kerneljasonxing@...il.com>,
Kuniyuki Iwashima <kuniyu@...zon.com>,
Zhongqiu Duan <dzq.aishenghu0@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>
Subject: [RFC PATCH] tcp: fill the one wscale sized window to trigger zero window advertising
If the rcvbuf of a slow receiver is full, the packet will be dropped
because tcp_try_rmem_schedule() cannot schedule more memory for it.
Usually the scaled window size is not MSS aligned. If the receiver
advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is
disabled, we need at least two packets to fill it. But the receiver will
not ACK the first one, and also do not offer a zero window since we never
shrink the offered window.
The sender waits for the ACK because the send window is not enough for
another MSS sized packet, tcp_snd_wnd_test() will return false, and
starts the TLP and then the retransmission timer for the first packet
until it is ACKed.
It may take a long time to resume transmission from retransmission after
the receiver clears the rcvbuf, depends on the times of retransmissions.
This issue should be rare today as GSO/TSO is a common technology,
especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or
commit d0d598ca86bd ("net: remove sk_route_forced_caps").
We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw
GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b).
Force split a large packet and send it to fill the window so that the
receiver can offer a zero window if he want.
Reproduce:
1. Set a large number for net.core.rmem_max on the RECV side to provide
a large wscale value of 11, which will provide a 2048 window larger
than the normal MSS 1448.
Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to
quickly trigger the problem. (optional)
sysctl net.core.rmem_max=67108864
sysctl net.ipv4.tcp_rmem="4096 131072 262144"
2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling
the GSO/TSO of nic on the SEND side.
(b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes))
algo. (Namespace and veth is okay, helper xfrm.sh is at the end.)
3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support)
(a) nc -l -p 11235
(b) nc -l -p 11235 -S
4. Send.
(a) nc 9.9.6.110 11235 <bigdata
(b) nc -S 9.9.7.110 11235 <bigdata
5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi)
ESTAB 0 0 9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256
6. The sender remains in the retransmission state. (ss -tnpOHoemi)
ESTAB 0 45104 9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048
Tcpdump:
```
51:10.437 S > R: seq 1910971411, win 64240, length 0
51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0
51:10.439 S > R: ack 1, win 502, length 0
51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448
51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448
51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448
51:10.440 R > S: ack 2897, win 31, length 0
51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448
51:10.440 R > S: ack 4345, win 31, length 0
51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448
51:10.440 R > S: ack 7241, win 30, length 0
<...>
51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448
51:10.527 R > S: ack 87257, win 2, length 0
51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448
51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448
51:10.577 R > S: ack 90153, win 1, length 0
51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448
51:10.627 R > S: ack 91601, win 1, length 0
<...>
51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448
51:14.127 R > S: ack 192961, win 1, length 0
51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448
51:14.177 R > S: ack 194409, win 1, length 0
<rcvbuf full>
51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448
<clear rcvbuf>
51:51.504 R > S: ack 194409, win 2, length 0
<retransmission timer timeout>
52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448
52:20.242 R > S: ack 195857, win 3, length 0
<...>
52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448
52:20.245 R > S: ack 223369, win 30, length 0
```
File: xfrm.sh
```
if [ "$1" = "l" ]; then
mode=tunnel
daddr=9.9.6.110
laddr=9.9.6.120
xdaddr=9.9.7.110
xladdr=9.9.7.120
ispi=0x20
ospi=0x10
dev=veth0
elif [ "$1" = "r" ]; then
mode=tunnel
daddr=9.9.6.120
laddr=9.9.6.110
xdaddr=9.9.7.120
xladdr=9.9.7.110
ispi=0x10
ospi=0x20
dev=veth1
else
echo "Usage: $0 <l|r>"
exit 1
fi
ip xfrm state flush
ip xfrm policy flush
ip link set $dev up
ip addr add $laddr/24 dev $dev
ip link add xfrm0 type xfrm dev $dev if_id 3
ip link set xfrm0 up
ip addr add $xladdr/24 dev xfrm0
ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm policy add dir in tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3
ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3
```
Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@...il.com>
---
net/ipv4/tcp_output.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..61debda90f6d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
{
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+ if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd)
+ return true;
+
if (skb->len > cur_mss)
end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
@@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
limit = mss_now;
- if (tso_segs > 1 && !tcp_urg_mode(tp))
+ if (!tcp_urg_mode(tp))
limit = tcp_mss_split_point(sk, skb, mss_now,
cwnd_quota,
nonagle);
--
2.34.1
Powered by blists - more mailing lists