[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ad6e6f7c-ba17-70d-d8c3-10703813cdc@ewheeler.net>
Date: Thu, 26 Jun 2025 13:16:44 -0700 (PDT)
From: Eric Wheeler <netdev@...ts.ewheeler.net>
To: Neal Cardwell <ncardwell@...gle.com>
cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>,
Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>,
stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
retransmits were sent
On Thu, 26 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> >
> > On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > >
> > > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > > >
> > > > > > > >
> > > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > > easier.
> > > > > > >
> > > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > > retransmit counts are still higher than the other. In the two sections
> > > > > > > below you can see the difference between after the fix and after the
> > > > > > > revert.
> > > > > > >
> > > > > >
> > > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > > >
> > > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > > average and geometric mean, and it still appears to be slightly slower
> > > > > then with the revert alone:
> > > > >
> > > > > # after-revert-6.6.93
> > > > > Arithmetic Mean: 843.64 Mbits/sec
> > > > > Geometric Mean: 841.95 Mbits/sec
> > > > >
> > > > > # after-tcp-fix-6.6.93
> > > > > Arithmetic Mean: 823.00 Mbits/sec
> > > > > Geometric Mean: 819.38 Mbits/sec
> > > > >
> > > >
> > > > Re-sending this question in case this message got lost:
> > > >
> > > > > Do you think that this is an actual performance regression, or just a
> > > > > sample set that is not big enough to work out the averages?
> > > > >
> > > > > Here is the data collected for each of the 50 tests:
> > > > > - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > > > - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> > >
> > > Hi Eric,
> > >
> > > Many thanks for this great data!
> > >
> > > I have been looking at this data. It's quite interesting.
> > >
> > > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > > cases (attached) it does look like for the 70-th percentile and below
> > > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > > that is lower, and IMHO this looks outside the realm of what we would
> > > expect from noise.
> > >
> > > However, when I look at the traces, I don't see any reason why the
> > > "fix" cases would be systematically slower. In particular, the "fix"
> > > and "revert" cases are only changing a function used for "undo"
> > > decisions, but for both the "fix" or "revert" cases, there are no
> > > "undo" events, and I don't see cases with spurious retransmissions
> > > where there should have been "undo" events and yet there were not.
> > >
> > > Visually inspecting the traces, the dominant determinant of
> > > performance seems to be how many RTO events there were. For example,
> > > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > > per trial looks random; I see similar qualitative patterns between
> > > "fix" and "revert" cases, and don't see any reason why there are more
> > > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > > be due to pre-existing (longstanding) performance problems in non-SACK
> > > loss recovery.
> > >
> > > One way to proceed would be for me to offer some performance fixes for
> > > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > > of performance variation. That should greatly reduce noise, and
> > > perhaps make it easier to see if there is any real difference between
> > > "fix" and "revert" cases.
> > >
> > > We could compare the following two kernels, with another 50 tests for
> > > each of two kernels:
> > >
> > > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> > >
> > > where:
> > >
> > > "revert" = revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > > no retransmits were sent")
> > > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > > tcp_is_non_sack_preventing_reopen() behavior"
> > >
> > > This would have the side benefit of testing some performance
> > > improvements for non-SACK connections.
> > >
> > > Are you up for that? :-)
> >
> >
> > Sure, if you have some patch ideas in mind, I'm all for getting patches
> > merged improve performance.
>
> Great! Thanks for being willing to do this! I will try to post some
> patches ASAP.
>
> > BTW, what causes a non-SACK connection? The RX side is a near-idle Linux
> > 6.8 host default sysctl settings.
>
> Given the RX side is a Linux 6.8 host, the kernel should be supporting
> TCP SACK due to kernel compile-time defaults (see the
> "net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.
>
> Given that factor, off-hand, I can think of only a few reasons why the
> RX side would not negotiate SACK support:
>
> (1) Some script or software on the RX machine has disabled SACK via
> "sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
> is easy to check with "sysctl net.ipv4.tcp_sack").
It looks like you are right:
# cat /proc/sys/net/ipv4/tcp_sack
0
and it runs way faster after turning it on:
~]# iperf3 -c 192.168.1.203
Connecting to host 192.168.1.203, port 5201
[ 5] local 192.168.1.52 port 55104 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 115 MBytes 964 Mbits/sec 27 234 KBytes
[ 5] 1.00-2.00 sec 113 MBytes 949 Mbits/sec 7 242 KBytes
[ 5] 2.00-3.00 sec 113 MBytes 950 Mbits/sec 7 247 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 947 Mbits/sec 8 261 KBytes
[ 5] 4.00-5.00 sec 114 MBytes 953 Mbits/sec 11 261 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 948 Mbits/sec 9 261 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 950 Mbits/sec 5 261 KBytes
[ 5] 7.00-8.00 sec 113 MBytes 947 Mbits/sec 10 272 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 950 Mbits/sec 5 274 KBytes
[ 5] 9.00-10.00 sec 113 MBytes 947 Mbits/sec 6 275 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 95 sender
[ 5] 0.00-10.04 sec 1.11 GBytes 945 Mbits/sec receiver
Do you want to continue troubleshooting non-SACK performance since I
have a reliable way to reproduce the issue, or leave it here with "I
should have had sack enabled" ?
-Eric
Powered by blists - more mailing lists