[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89i+KDDrP4xOniv6zej4SAjd5SNwR=qfu2f66F-L2+J=ZSw@mail.gmail.com>
Date: Fri, 17 Jan 2025 20:50:43 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Zhongqiu Duan <dzq.aishenghu0@...il.com>
Cc: netdev@...r.kernel.org, Jason Xing <kerneljasonxing@...il.com>,
Kuniyuki Iwashima <kuniyu@...zon.com>, "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>
Subject: Re: [RFC PATCH] tcp: fill the one wscale sized window to trigger zero
window advertising
On Fri, Jan 17, 2025 at 8:29 PM Zhongqiu Duan <dzq.aishenghu0@...il.com> wrote:
>
> If the rcvbuf of a slow receiver is full, the packet will be dropped
> because tcp_try_rmem_schedule() cannot schedule more memory for it.
> Usually the scaled window size is not MSS aligned. If the receiver
> advertised a one wscale sized window is in (MSS, 2*MSS), and GSO/TSO is
> disabled, we need at least two packets to fill it. But the receiver will
> not ACK the first one, and also do not offer a zero window since we never
> shrink the offered window.
> The sender waits for the ACK because the send window is not enough for
> another MSS sized packet, tcp_snd_wnd_test() will return false, and
> starts the TLP and then the retransmission timer for the first packet
> until it is ACKed.
> It may take a long time to resume transmission from retransmission after
> the receiver clears the rcvbuf, depends on the times of retransmissions.
>
> This issue should be rare today as GSO/TSO is a common technology,
> especially after 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") or
> commit d0d598ca86bd ("net: remove sk_route_forced_caps").
> We can reproduce it by reverting commit 0a6b2a1dc2a2 and disabling hw
> GSO/TSO from nic using ethtool (a). Or enabling MD5SIG (b).
>
> Force split a large packet and send it to fill the window so that the
> receiver can offer a zero window if he want.
>
> Reproduce:
>
> 1. Set a large number for net.core.rmem_max on the RECV side to provide
> a large wscale value of 11, which will provide a 2048 window larger
> than the normal MSS 1448.
> Set a slightly lower value for net.ipv4.tcp_rmem on the RECV side to
> quickly trigger the problem. (optional)
>
> sysctl net.core.rmem_max=67108864
> sysctl net.ipv4.tcp_rmem="4096 131072 262144"
>
> 2. (a) Build customized kernel with 0a6b2a1dc2a2 reverted and disabling
> the GSO/TSO of nic on the SEND side.
> (b) Or setup the xfrm tunnel with esp proto and aead rfc4106(gcm(aes))
> algo. (Namespace and veth is okay, helper xfrm.sh is at the end.)
>
> 3. Start the receiver but don't receive. (netcat-bsd with MD5SIG support)
> (a) nc -l -p 11235
> (b) nc -l -p 11235 -S
>
> 4. Send.
> (a) nc 9.9.6.110 11235 <bigdata
> (b) nc -S 9.9.7.110 11235 <bigdata
>
> 5. After tens of seconds, the receiver clears the rcvbuf. (ss -tnpOHoemi)
>
> ESTAB 0 0 9.9.6.120:11235 9.9.6.110:48038 users:(("nc",pid=1380,fd=4)) ino:19894 sk:c cgroup:/ <-> skmem:(r0,rb262144,t0,tb46080,f266240,w0,o0,bl0,d19) ts sack cubic wscale:7,11 rto:200 rtt:1.177/0.588 ato:200 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_received:392784 segs_out:139 segs_in:295 data_segs_in:293 send 98419711bps lastsnd:125850 lastrcv:55400 lastack:22130 pacing_rate 196839416bps delivered:1 rcv_rtt:0.977 rcv_space:194408 rcv_ssthresh:2896 minrtt:1.177 snd_wnd:64256
>
> 6. The sender remains in the retransmission state. (ss -tnpOHoemi)
>
> ESTAB 0 45104 9.9.6.110:48038 9.9.6.120:11235 users:(("nc",pid=1349,fd=3)) timer:(on,30sec,7) ino:16914 sk:8 cgroup:/ <-> skmem:(r0,rb131072,t0,tb96768,f4048,w86064,o0,bl0,d0) ts sack cubic wscale:11,7 rto:32000 backoff:7 rtt:49.988/0.047 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:1 ssthresh:14 bytes_sent:208888 bytes_retrans:13032 bytes_acked:194409 segs_out:149 segs_in:92 data_segs_out:147 send 231736bps lastsnd:1100 lastrcv:38270 lastack:34530 pacing_rate 5839704bps delivery_rate 231944bps delivered:139 busy:38270ms rwnd_limited:38180ms(99.8%) unacked:1 retrans:1/9 lost:1 dsack_dups:1 rcv_space:14480 rcv_ssthresh:64088 notsent:43656 minrtt:0.254 snd_wnd:2048
>
> Tcpdump:
>
> ```
> 51:10.437 S > R: seq 1910971411, win 64240, length 0
> 51:10.438 R > S: seq 2609098178, ack 1910971412, win 65160, length 0
> 51:10.439 S > R: ack 1, win 502, length 0
> 51:10.439 S > R: seq 1:1449, ack 1, win 502, length 1448
> 51:10.439 S > R: seq 1449:2897, ack 1, win 502, length 1448
> 51:10.439 S > R: seq 2897:4345, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 2897, win 31, length 0
> 51:10.440 S > R: seq 4345:5793, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 4345, win 31, length 0
> 51:10.440 S > R: seq 5793:7241, ack 1, win 502, length 1448
> 51:10.440 R > S: ack 7241, win 30, length 0
> <...>
> 51:10.485 S > R: seq 85809:87257, ack 1, win 502, length 1448
> 51:10.527 R > S: ack 87257, win 2, length 0
> 51:10.527 S > R: seq 87257:88705, ack 1, win 502, length 1448
> 51:10.527 S > R: seq 88705:90153, ack 1, win 502, length 1448
> 51:10.577 R > S: ack 90153, win 1, length 0
> 51:10.578 S > R: seq 90153:91601, ack 1, win 502, length 1448
> 51:10.627 R > S: ack 91601, win 1, length 0
> <...>
> 51:14.077 S > R: seq 191513:192961, ack 1, win 502, length 1448
> 51:14.127 R > S: ack 192961, win 1, length 0
> 51:14.127 S > R: seq 192961:194409, ack 1, win 502, length 1448
> 51:14.177 R > S: ack 194409, win 1, length 0
> <rcvbuf full>
I have not seen a "win 0" though...
> 51:14.177 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:14.431 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:14.691 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:15.201 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:16.241 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:18.321 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:22.401 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:30.961 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 51:47.601 S > R: seq 194409:195857, ack 1, win 502, length 1448
> <clear rcvbuf>
> 51:51.504 R > S: ack 194409, win 2, length 0
> <retransmission timer timeout>
> 52:20.242 S > R: seq 194409:195857, ack 1, win 502, length 1448
> 52:20.242 R > S: ack 195857, win 3, length 0
> <...>
> 52:20.245 S > R: seq 223369:224817, ack 1, win 502, length 1448
> 52:20.245 R > S: ack 223369, win 30, length 0
> ```
>
> File: xfrm.sh
>
> ```
> if [ "$1" = "l" ]; then
> mode=tunnel
> daddr=9.9.6.110
> laddr=9.9.6.120
> xdaddr=9.9.7.110
> xladdr=9.9.7.120
> ispi=0x20
> ospi=0x10
> dev=veth0
> elif [ "$1" = "r" ]; then
> mode=tunnel
> daddr=9.9.6.120
> laddr=9.9.6.110
> xdaddr=9.9.7.120
> xladdr=9.9.7.110
> ispi=0x10
> ospi=0x20
> dev=veth1
> else
> echo "Usage: $0 <l|r>"
> exit 1
> fi
>
> ip xfrm state flush
> ip xfrm policy flush
> ip link set $dev up
> ip addr add $laddr/24 dev $dev
> ip link add xfrm0 type xfrm dev $dev if_id 3
> ip link set xfrm0 up
> ip addr add $xladdr/24 dev xfrm0
> ip xfrm state add src $laddr dst $daddr spi $ospi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
> ip xfrm state add src $daddr dst $laddr spi $ispi proto esp mode $mode if_id 3 aead 'rfc4106(gcm(aes))' 0x856f15d0ccabe682286b4286bccf5d595b88b168 128
> ip xfrm policy add dir in tmpl src $daddr dst $laddr proto esp spi $ispi mode $mode if_id 3
> ip xfrm policy add dir out tmpl src $laddr dst $daddr proto esp spi $ospi mode $mode if_id 3
> ```
>
> Signed-off-by: Zhongqiu Duan <dzq.aishenghu0@...il.com>
> ---
> net/ipv4/tcp_output.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..61debda90f6d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2143,6 +2143,9 @@ static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
> {
> u32 end_seq = TCP_SKB_CB(skb)->end_seq;
>
> + if (tp->rx_opt.snd_wscale && (1 << tp->rx_opt.snd_wscale) == tp->snd_wnd)
> + return true;
This is not generic.
What if tp->snd_wnd == (2 << tp->rx_opt.snd_wscale), for wscale == 10 ?
> +
> if (skb->len > cur_mss)
> end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
>
> @@ -2806,7 +2809,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> }
>
> limit = mss_now;
> - if (tso_segs > 1 && !tcp_urg_mode(tp))
> + if (!tcp_urg_mode(tp))
> limit = tcp_mss_split_point(sk, skb, mss_now,
> cwnd_quota,
> nonagle);
> --
> 2.34.1
I think you are trying to solve the issue at the sender side, in the
fast path, adding lots of cycles.
While the issue seems to be a receive side one, failing to send a "win
0" at the right time/conditions.
If the last ACK had a "win 1", I fail to see why a packet with length
<= 2048 can not be received.
Powered by blists - more mailing lists