netdev - Re: iperf performance regression since Linux 5.18

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQymM2HrGrMGyJX2QQ9PpgQT8JqsRz_0U8_WvdvzteqsfEQ@mail.gmail.com>
Date: Sat, 14 Oct 2023 15:40:33 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Stefan Wahren <wahrenst@....net>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Fabio Estevam <festevam@...il.com>, linux-imx@....com, 
	Stefan Wahren <stefan.wahren@...rgebyte.com>, Michael Heimpold <mhei@...mpold.de>, netdev@...r.kernel.org, 
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: iperf performance regression since Linux 5.18

On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@....net> wrote:
>
> Hi,
>
> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@....net> wrote:
> >>> Hi,
> >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> >>> that we noticed a measurable performance regression on the Ethernet
> >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> >>> device:
> >>>
> >>> BAD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval       Transfer     Bandwidth
> >>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> >>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> >>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> >>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> >>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> >>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> >>>
> >>> GOOD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval       Transfer     Bandwidth
> >>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> >>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> >>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> >>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> >>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> >>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> >>>
> >>> We were able to bisect this down to this commit:
> >>>
> >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> >>> TSO packet sizes based on min_rtt
> >>>
> >>> Disabling this new setting via:
> >>>
> >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> >>>
> >>> confirm that this was the cause of the performance regression.
> >>>
> >>> Is it expected that the new default setting has such a performance impact?
> > Indeed, thanks for the report.
> >
> > In addition to the "ss" output Eric mentioned, could you please grab
> > "nstat" output, which should allow us to calculate the average TSO/GSO
> > and LRO/GRO burst sizes, which is the key thing tuned with the
> > tcp_tso_rtt_log knob.
> >
> > So it would be great to have the following from both data sender and
> > data receiver, for both the good case and bad case, if you could start
> > these before your test and kill them after the test stops:
> >
> > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
> i upload everything here:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>
> The server part is a Ubuntu installation connected to the internet. At
> first i logged the good case, then i continued with the bad case.
> Accidentally i delete a log file of bad case, so i repeated the whole
> bad case again. So the uploaded bad case files are from the third run.

Thanks for the detailed data!

Here are some notes from looking at this data:

+ bad client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
IpOutRequests                   308               44.7
IpExtOutOctets                  10050656        1403181.0
est bytes   per TSO burst: 10050656 / 308 = 32632
est packets per TSO burst: 32632 / 1448 ~= 22.5

+ good client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
IpOutRequests                   529               62.0
IpExtOutOctets                  11502992        1288711.5
est bytes   per TSO burst: 11502992 / 529 ~= 21745
est packets per TSO burst: 21745 / 1448 ~= 15.0

+ bad client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6

+ good client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
Fri Oct 13 15:04:36 CEST 2023
State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5

So it seems indeed like cwnd is not the limiting factor, and instead
there is something about the larger TSO/GSO bursts (roughly 22.5
packets per burst on average) in the "bad" case that is causing
problems, and preventing the sender from keeping the pipe fully
utilized.

So perhaps the details of the tcp_tso_should_defer() logic are hurting
performance?

The default value of tcp_tso_win_divisor is 3, and in the bad case the
cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
suspiciously close to the average TSO burst size of 22.5. So my guess
is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
TSO burst size in the "good" case, and fully utilize the pipe. So it
seems worth an experiment, to see what we can learn.

To test that theory, could you please try running the following as
root on the data sender machine, and then re-running the "bad" test
with tcp_tso_rtt_log at the default value of 9?

   sysctl net.ipv4.tcp_tso_win_divisor=5

Thanks!
neal