lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250117192859.28252-1-dzq.aishenghu0@gmail.com>
Date: Sat, 18 Jan 2025 03:28:59 +0800
From: Zhongqiu Duan <dzq.aishenghu0@...il.com>
To: netdev@...r.kernel.org
Cc: Jason Xing <kerneljasonxing@...il.com>,
	Kuniyuki Iwashima <kuniyu@...zon.com>,
	Zhongqiu Duan <dzq.aishenghu0@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	"David S. Miller" <davem@...emloft.net>,
	David Ahern <dsahern@...nel.org>,
	Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>
Subject: [RFC PATCH] tcp: fill the one wscale sized window to trigger zero window advertising

If the rcvbuf of a slow receiver is full, the packet will be dropped
because tcp_try_rmem_schedule() cannot schedule more memory for it.
Usually the scaled window size is not MSS aligned. If the receiver
advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is
disabled, we need at least two packets to fill it. But the receiver will
not ACK the first one, and also do not offer a zero window since we never
shrink the offered window.
The sender waits for the ACK because the send window is not enough for
another MSS sized packet, tcp_snd_wnd_test() will return false, and
starts the TLP and then the retransmission timer for the first packet
until it is ACKed.
It may take a long time to resume transmission from retransmission after
the receiver clears the rcvbuf, depends on the times of retransmissions.

This issue should be rare today as GSO/TSO is a common technology,
especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or
commit d0d598ca86bd ("net: remove sk_route_forced_caps").
We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw
GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b).

Force split a large packet and send it to fill the window so that the
receiver can offer a zero window if he want.

Reproduce:

1. Set a large number for net.core.rmem_max on the RECV side to provide
   a large wscale value of 11, which will provide a 2048 window larger
   than the normal MSS 1448.
   Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to
   quickly trigger the problem. (optional)

   sysctl net.core.rmem_max=67108864
   sysctl net.ipv4.tcp_rmem="4096 131072 262144"

2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling
   the GSO/TSO of nic on the SEND side.
   (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes))
   algo. (Namespace and veth is okay, helper xfrm.sh is at the end.)

3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support)
   (a) nc -l -p 11235
   (b) nc -l -p 11235 -S

4. Send.
   (a) nc 9.9.6.110 11235 <bigdata
   (b) nc -S 9.9.7.110 11235 <bigdata

5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi)

ESTAB 0      0      9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256

6. The sender remains in the retransmission state. (ss -tnpOHoemi)

ESTAB 0      45104  9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048

Tcpdump:

```
51:10.437 S > R: seq 1910971411, win 64240, length 0
51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0
51:10.439 S > R: ack 1, win 502, length 0
51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448
51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448
51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448
51:10.440 R > S: ack 2897, win 31, length 0
51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448
51:10.440 R > S: ack 4345, win 31, length 0
51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448
51:10.440 R > S: ack 7241, win 30, length 0
<...>
51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448
51:10.527 R > S: ack 87257, win 2, length 0
51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448
51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448
51:10.577 R > S: ack 90153, win 1, length 0
51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448
51:10.627 R > S: ack 91601, win 1, length 0
<...>
51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448
51:14.127 R > S: ack 192961, win 1, length 0
51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448
51:14.177 R > S: ack 194409, win 1, length 0
<rcvbuf full>
51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448
51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448
<clear rcvbuf>
51:51.504 R > S: ack 194409, win 2, length 0
<retransmission timer timeout>
52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448
52:20.242 R > S: ack 195857, win 3, length 0
<...>
52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448
52:20.245 R > S: ack 223369, win 30, length 0
```

File: xfrm.sh

```
if [ "$1" = "l" ]; then
        mode=tunnel
        daddr=9.9.6.110
        laddr=9.9.6.120
        xdaddr=9.9.7.110
        xladdr=9.9.7.120
        ispi=0x20
        ospi=0x10
        dev=veth0
elif [ "$1" = "r" ]; then
        mode=tunnel
        daddr=9.9.6.120
        laddr=9.9.6.110
        xdaddr=9.9.7.120
        xladdr=9.9.7.110
        ispi=0x10
        ospi=0x20
        dev=veth1
else
        echo "Usage: $0 <l|r>"
        exit 1
fi

ip xfrm state flush
ip xfrm policy flush
ip link set $dev up
ip addr add $laddr/24 dev $dev
ip link add xfrm0 type xfrm dev $dev if_id 3
ip link set xfrm0 up
ip addr add $xladdr/24 dev xfrm0
ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
ip xfrm policy add dir in  tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3
ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3
```

Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@...il.com>
---
 net/ipv4/tcp_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..61debda90f6d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
 {
 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
+	if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd)
+		return true;
+
 	if (skb->len > cur_mss)
 		end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
 
@@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 
 		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
+		if (!tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
 						    cwnd_quota,
 						    nonagle);
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ