[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQymM2HrGrMGyJX2QQ9PpgQT8JqsRz_0U8_WvdvzteqsfEQ@mail.gmail.com>
Date: Sat, 14 Oct 2023 15:40:33 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Stefan Wahren <wahrenst@....net>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Fabio Estevam <festevam@...il.com>, linux-imx@....com,
Stefan Wahren <stefan.wahren@...rgebyte.com>, Michael Heimpold <mhei@...mpold.de>, netdev@...r.kernel.org,
Yuchung Cheng <ycheng@...gle.com>
Subject: Re: iperf performance regression since Linux 5.18
On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@....net> wrote:
>
> Hi,
>
> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@....net> wrote:
> >>> Hi,
> >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> >>> that we noticed a measurable performance regression on the Ethernet
> >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> >>> device:
> >>>
> >>> BAD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval Transfer Bandwidth
> >>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
> >>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
> >>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
> >>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
> >>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
> >>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
> >>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
> >>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
> >>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
> >>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
> >>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
> >>>
> >>> GOOD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval Transfer Bandwidth
> >>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
> >>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
> >>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
> >>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
> >>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
> >>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
> >>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
> >>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
> >>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
> >>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
> >>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
> >>>
> >>> We were able to bisect this down to this commit:
> >>>
> >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> >>> TSO packet sizes based on min_rtt
> >>>
> >>> Disabling this new setting via:
> >>>
> >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> >>>
> >>> confirm that this was the cause of the performance regression.
> >>>
> >>> Is it expected that the new default setting has such a performance impact?
> > Indeed, thanks for the report.
> >
> > In addition to the "ss" output Eric mentioned, could you please grab
> > "nstat" output, which should allow us to calculate the average TSO/GSO
> > and LRO/GRO burst sizes, which is the key thing tuned with the
> > tcp_tso_rtt_log knob.
> >
> > So it would be great to have the following from both data sender and
> > data receiver, for both the good case and bad case, if you could start
> > these before your test and kill them after the test stops:
> >
> > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
> i upload everything here:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>
> The server part is a Ubuntu installation connected to the internet. At
> first i logged the good case, then i continued with the bad case.
> Accidentally i delete a log file of bad case, so i repeated the whole
> bad case again. So the uploaded bad case files are from the third run.
Thanks for the detailed data!
Here are some notes from looking at this data:
+ bad client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
IpOutRequests 308 44.7
IpExtOutOctets 10050656 1403181.0
est bytes per TSO burst: 10050656 / 308 = 32632
est packets per TSO burst: 32632 / 1448 ~= 22.5
+ good client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
IpOutRequests 529 62.0
IpExtOutOctets 11502992 1288711.5
est bytes per TSO burst: 11502992 / 529 ~= 21745
est packets per TSO burst: 21745 / 1448 ~= 15.0
+ bad client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
+ good client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
Fri Oct 13 15:04:36 CEST 2023
State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
So it seems indeed like cwnd is not the limiting factor, and instead
there is something about the larger TSO/GSO bursts (roughly 22.5
packets per burst on average) in the "bad" case that is causing
problems, and preventing the sender from keeping the pipe fully
utilized.
So perhaps the details of the tcp_tso_should_defer() logic are hurting
performance?
The default value of tcp_tso_win_divisor is 3, and in the bad case the
cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
suspiciously close to the average TSO burst size of 22.5. So my guess
is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
TSO burst size in the "good" case, and fully utilize the pipe. So it
seems worth an experiment, to see what we can learn.
To test that theory, could you please try running the following as
root on the data sender machine, and then re-running the "bad" test
with tcp_tso_rtt_log at the default value of 9?
sysctl net.ipv4.tcp_tso_win_divisor=5
Thanks!
neal
Powered by blists - more mailing lists