netdev - Is there a maximum bytes in flight limitation in the tcp stack?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <BF6B00CC65FD2D45A326E74492B2C19FB77853E9@FR711WXCHMBA05.zeu.alcatel-lucent.com>
Date:   Thu, 3 Nov 2016 16:37:48 +0000
From:   "De Schepper, Koen (Nokia - BE)" 
        <koen.de_schepper@...ia-bell-labs.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Is there a maximum bytes in flight limitation in the tcp stack?

Hi,

We experience some limit on the maximum packets in flight which seem not to be related with the receive or write buffers. Does somebody know if there is an issue with a maximum of around 1MByte (or sometimes 2Mbyte) of data in flight per TCP flow?

It seems to be a strict and stable limit independent from the CC (tested with Cubic, Reno and DCTCP). On a link of 200Mbps and 200ms RTT our link is only 20% (sometimes 40%, see conditions below) utilized for a single TCP flow with no drop experienced at all (no bottleneck in the AQM or RTT emulation, as it supports more throughput if multiple flows are active).

Some configuration changes we already tried on both client and server (kernel 3.18.9):

net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304

SERVER# ss -i
tcp    ESTAB      0      1049728  10.187.255.211:46642     10.187.16.194:ssh
	 dctcp wscale:7,7 rto:408 rtt:204.333/0.741 ato:40 mss:1448 cwnd:1466 send 83.1Mbps unacked:728 rcv_rtt:212 rcv_space:29200
CLIENT# ss -i
tcp    ESTAB      0      288      10.187.16.194:ssh      10.187.255.211:46642
	 dctcp wscale:7,7 rto:404 rtt:203.389/0.213 ato:40 mss:1448 cwnd:78 send 4.4Mbps unacked:8 rcv_rtt:204 rcv_space:1074844

When increasing the write and receive mem further (they were already way above 1 or 2 MB) it steps to double (40%; 2Mbytes in flight):
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_rmem = 4096 8000000 16291456
net.ipv4.tcp_wmem = 4096 8000000 16291456

SERVER # ss -i
tcp    ESTAB      0      2068976  10.187.255.212:54637     10.187.16.112:ssh
	 cubic wscale:8,8 rto:404 rtt:202.622/0.061 ato:40 mss:1448 cwnd:1849 ssthresh:1140 send 105.7Mbps unacked:1457 rcv_rtt:217.5 rcv_space:29200
CLIENT# ss -i
tcp    ESTAB      0      648      10.187.16.112:ssh      10.187.255.212:54637
	 cubic wscale:8,8 rto:404 rtt:201.956/0.038 ato:40 mss:1448 cwnd:132 send 7.6Mbps unacked:18 rcv_rtt:204 rcv_space:2093044

Further increasing (x10) does not help anymore...
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_rmem = 4096 80000000 162914560
net.ipv4.tcp_wmem = 4096 80000000 162914560

As all these parameters autotune, it is hard to find out which one is limiting... In the examples, above unacked does not want to go higher, while congestion window in the server is big enough... rcv_space could be limiting, but it tunes up if I change the server with the higher buffers (switching to 2MByte in flight).

We also tried tcp_limit_output_bytes, setting it bigger (x10) and smaller(/10), without effect. We've put it in /etc/sysctl.conf and rebooted, to make sure that it is effective.

Some more detailed tests that had an effect on the 1 or 2MByte:
- It seems that with TSO off, if we configure a bigger wmem buffer, an ongoing flow suddenly is able to immediately double its bytes in flight limit. We configured further up to more than 10x the buffer, but no further increase helps, and the limits we saw are only 1MByte and 2Mbyte (no intermediate values depending on any parameter). When setting tcp_wmem smaller again, the 2MByte limit stays on the ongoing flow. We have to restart the flow to make the buffer reduction to 1MByte effective.
- With TSO on, only the 2MByte limit is effective, independent from the wmem buffer. We have to restart the flow to make a tso change effective.

Koen.