netdev - Re: Potential impact of commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60b04b0a-a50e-4d4a-a2bf-ea420f428b9c@quicinc.com>
Date: Wed, 15 May 2024 20:32:27 -0600
From: "Subash Abhinov Kasiviswanathan (KS)" <quic_subashab@...cinc.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: <soheil@...gle.com>, <ncardwell@...gle.com>, <yyd@...gle.com>,
        <ycheng@...gle.com>, <quic_stranche@...cinc.com>,
        <davem@...emloft.net>, <kuba@...nel.org>, <netdev@...r.kernel.org>
Subject: Re: Potential impact of commit dfa2f0483360 ("tcp: get rid of
 sysctl_tcp_adv_win_scale")

On 5/15/2024 1:10 AM, Eric Dumazet wrote:
> On Wed, May 15, 2024 at 6:47 AM Subash Abhinov Kasiviswanathan (KS)
> <quic_subashab@...cinc.com> wrote:
>>
>> We recently noticed that a device running a 6.6.17 kernel (A) was having
>> a slower single stream download speed compared to a device running
>> 6.1.57 kernel (B). The test here is over mobile radio with iperf3 with
>> window size 4M from a third party server.
> 
> Hi Subash
> 
> I think you gave many details, but please give us more of them :

Hi Eric

Thanks for getting back. Hope the information below is useful.

> 
> 1) What driver is used on the receiver side.
rmnet

> 2) MTU
1372

> 3) cat /proc/sys/net/ipv4/tcp_rmem
4096 6291456 16777216

> 
> Ideally, you could snapshot "ss -temoi dst <otherpeer>" on receive
> side while the transfer is ongoing,
> and possibly while stopping the receiver thread (kill -STOP `pidof iperf`)
> 
192.0.0.2 is the device side address. I've listed the output of "ss 
-temoi dst 223.62.236.10" mid transfer and one around the end of transfer.

I believe iperf3 makes a control connection prior to triggering the data 
connection so it will list two flows.  The transfer between 
192.0.0.2:42278 <-> 223.62.236.10:5215 is the main data connection in 
this case.

//mid transfer
State       Recv-Q Send-Q Local Address:Port                 Peer 
Address:Port

ESTAB       0      0      192.0.0.2:42278                223.62.236.10:5215
     ino:129232 sk:3218 fwmark:0xc0078 <->
          skmem:(r0,rb8388608,t0,tb8388608,f0,w0,o0,bl0,d1) ts sack 
cubic wscale:7,6 rto:236 rtt:34.249/16.545 ato:40 mss:1320 rcvmss:1320 
advmss:1320 cwnd:10 ssthresh:1400 bytes_acked:38 
bytes_received:211495680 segs_out:46198 segs_in:160290 data_segs_out:1 
data_segs_in:160287 send 3.1Mbps lastsnd:3996 pacing_rate 6.2Mbps 
delivery_rate 452.4Kbps app_limited busy:24ms rcv_rtt:26.542 
rcv_space:3058440 minrtt:23.34
ESTAB       0      0      192.0.0.2:42270                223.62.236.10:5215
     ino:128718 sk:4273 fwmark:0xc0078 <->
          skmem:(r0,rb6291456,t0,tb2097152,f0,w0,o0,bl0,d0) ts sack 
cubic wscale:10,9 rto:528 rtt:144.931/93.4 ato:40 mss:1320 rcvmss:536 
advmss:1320 cwnd:10 ssthresh:1400 bytes_acked:223 bytes_received:4 
segs_out:9 segs_in:8 data_segs_out:3 data_segs_in:4 send 728.6Kbps 
lastsnd:6064 lastrcv:3948 lastack:3948 pacing_rate 1.5Mbps delivery_rate 
351.8Kbps app_limited busy:156ms rcv_space:13200 minrtt:30.021

//close to end of transfer
State       Recv-Q Send-Q Local Address:Port                 Peer 
Address:Port

ESTAB       4324072 0      192.0.0.2:42278                223.62.236.10:5215
      ino:129232 sk:3218 fwmark:0xc0078 <->
          skmem:(r4511016,rb8388608,t0,tb8388608,f2776,w0,o0,bl0,d1) ts 
sack cubic wscale:7,6 rto:236 rtt:34.249/16.545 ato:40 mss:1320 
rcvmss:1320 advmss:1320 cwnd:10 ssthresh:1400 bytes_acked:38 
bytes_received:608252040 segs_out:133117 segs_in:460963 data_segs_out:1 
data_segs_in:460960 send 3.1Mbps lastsnd:10104 pacing_rate 6.2Mbps 
delivery_rate 452.4Kbps app_limited busy:24ms rcv_rtt:25.111 
rcv_space:3871560 minrtt:23.34
ESTAB       0      294    192.0.0.2:42270                223.62.236.10:5215
     timer:(on,412ms,0) ino:128718 sk:4273 fwmark:0xc0078 <->
          skmem:(r0,rb6291456,t0,tb2097152,f2010,w2086,o0,bl0,d0) ts 
sack cubic wscale:10,9 rto:512 rtt:129.796/94.265 ato:40 mss:1320 
rcvmss:536 advmss:1320 cwnd:10 ssthresh:1400 bytes_acked:224 
bytes_received:5 segs_out:12 segs_in:9 data_segs_out:5 data_segs_in:5 
send 813.6Kbps lastsnd:48 lastrcv:52 lastack:52 pacing_rate 1.6Mbps 
delivery_rate 442.8Kbps app_limited busy:228ms unacked:1 rcv_space:13200 
notsent:290 minrtt:23.848

> TCP is sensitive to the skb->len/skb->truesize ratio.
> Some drivers are known to provide 'bad skbs' in this regard.
> 
> Commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") is
> simply a step for dynamic
> probing of skb->len/skb->truesize ratio, and give incentive for better
> memory use.
> 
> Ultimately, TCP RWIN derives from effective memory usage.
> 
> Sending a too big RWIN can cause excessive memory usage or packet drops.
> If you say RWIN was 6MB+ before the patch, this looks like a bug to me,
> because tcp_rmem[2] = 6MB by default. There is no way a driver can
> pack 6MB of TCP payload in 6MB of memory (no skb/headers overhead ???)
> This would only work well in lossless networks, and if receiving
> application drains TCP receive queue fast enough.
> 
> Please take a look at these relevant patches.
> Note they are not perfect patches, because usbnet can still provide
> 'bad skbs', forcing TCP to send small RWIN.
rmnet is not updating the truesize directly in the receive path. There 
is no cloning and there is an explicit copy of the data content to a 
freshly allocated skb similar to your commits shared below.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c?h=v6.6.17#n385

 From netif_receive_skb_entry tracing, I see that the truesize is around 
~2.5K for ~1.5K packets.

> 
> d50729f1d60bca822ef6d9c1a5fb28d486bd7593 net: usb: smsc95xx: stop
> lying about skb->truesize
> 05417aa9c0c038da2464a0c504b9d4f99814a23b net: usb: sr9700: stop lying
> about skb->truesize
> 1b3b2d9e772b99ea3d0f1f2252bf7a1c94b88be6 net: usb: smsc75xx: stop
> lying about skb->truesize
> 9aad6e45c4e7d16b2bb7c3794154b828fb4384b4 usb: aqc111: stop lying about
> skb->truesize
> 4ce62d5b2f7aecd4900e7d6115588ad7f9acccca net: usb: ax88179_178a: stop
> lying about skb->truesize

I reviewed many of the tcpdumps from other tests internally and I 
consistently see the receiver window size scale to roughly half of what 
is specified in iperf3 regardless of whatever radio configurations or 
MTU. There was no download speed issue reported for any of these cases.

I believe this particular download test is failing as the RTT is likely 
higher in this network than the other cases.