netdev - Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ad6e6f7c-ba17-70d-d8c3-10703813cdc@ewheeler.net>
Date: Thu, 26 Jun 2025 13:16:44 -0700 (PDT)
From: Eric Wheeler <netdev@...ts.ewheeler.net>
To: Neal Cardwell <ncardwell@...gle.com>
cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
    Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>, 
    Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>, 
    stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
 retransmits were sent

On Thu, 26 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> >
> > On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > >
> > > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > > >
> > > > > > > >
> > > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > > easier.
> > > > > > >
> > > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > > retransmit counts are still higher than the other.  In the two sections
> > > > > > > below you can see the difference between after the fix and after the
> > > > > > > revert.
> > > > > > >
> > > > > >
> > > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > > >
> > > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > > average and geometric mean, and it still appears to be slightly slower
> > > > > then with the revert alone:
> > > > >
> > > > >       # after-revert-6.6.93
> > > > >       Arithmetic Mean: 843.64 Mbits/sec
> > > > >       Geometric Mean: 841.95 Mbits/sec
> > > > >
> > > > >       # after-tcp-fix-6.6.93
> > > > >       Arithmetic Mean: 823.00 Mbits/sec
> > > > >       Geometric Mean: 819.38 Mbits/sec
> > > > >
> > > >
> > > > Re-sending this question in case this message got lost:
> > > >
> > > > > Do you think that this is an actual performance regression, or just a
> > > > > sample set that is not big enough to work out the averages?
> > > > >
> > > > > Here is the data collected for each of the 50 tests:
> > > > >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > > >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> > >
> > > Hi Eric,
> > >
> > > Many thanks for this great data!
> > >
> > > I have been looking at this data. It's quite interesting.
> > >
> > > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > > cases (attached) it does look like for the 70-th percentile and below
> > > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > > that is lower, and IMHO this looks outside the realm of what we would
> > > expect from noise.
> > >
> > > However, when I look at the traces, I don't see any reason why the
> > > "fix" cases would be systematically slower. In particular, the "fix"
> > > and "revert" cases are only changing a function used for "undo"
> > > decisions, but for both the "fix" or "revert" cases, there are no
> > > "undo" events, and I don't see cases with spurious retransmissions
> > > where there should have been "undo" events and yet there were not.
> > >
> > > Visually inspecting the traces, the dominant determinant of
> > > performance seems to be how many RTO events there were. For example,
> > > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > > per trial looks random; I see similar qualitative patterns between
> > > "fix" and "revert" cases, and don't see any reason why there are more
> > > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > > be due to pre-existing (longstanding) performance problems in non-SACK
> > > loss recovery.
> > >
> > > One way to proceed would be for me to offer some performance fixes for
> > > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > > of performance variation. That should greatly reduce noise, and
> > > perhaps make it easier to see if there is any real difference between
> > > "fix" and "revert" cases.
> > >
> > > We could compare the following two kernels, with another 50 tests for
> > > each of two kernels:
> > >
> > > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> > >
> > > where:
> > >
> > > "revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > > no retransmits were sent")
> > > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > > tcp_is_non_sack_preventing_reopen() behavior"
> > >
> > > This would have the side benefit of testing some performance
> > > improvements for non-SACK connections.
> > >
> > > Are you up for that? :-)
> >
> >
> > Sure, if you have some patch ideas in mind, I'm all for getting patches
> > merged improve performance.
> 
> Great! Thanks for being willing to do this! I will try to post some
> patches ASAP.
> 
> > BTW, what causes a non-SACK connection?  The RX side is a near-idle Linux
> > 6.8 host default sysctl settings.
> 
> Given the RX side is a Linux 6.8 host, the kernel should be supporting
> TCP SACK due to kernel compile-time defaults (see the
> "net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.
>
> Given that factor, off-hand, I can think of only a few reasons why the
> RX side would not negotiate SACK support:
> 
> (1) Some script or software on the RX machine has disabled SACK via
> "sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
> is easy to check with "sysctl net.ipv4.tcp_sack").


It looks like you are right:

	# cat /proc/sys/net/ipv4/tcp_sack 
	0

and it runs way faster after turning it on:

	~]# iperf3 -c 192.168.1.203
	Connecting to host 192.168.1.203, port 5201
	[  5] local 192.168.1.52 port 55104 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   115 MBytes   964 Mbits/sec   27    234 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    7    242 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   950 Mbits/sec    7    247 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   947 Mbits/sec    8    261 KBytes       
	[  5]   4.00-5.00   sec   114 MBytes   953 Mbits/sec   11    261 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   948 Mbits/sec    9    261 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   950 Mbits/sec    5    261 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   947 Mbits/sec   10    272 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   950 Mbits/sec    5    274 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   947 Mbits/sec    6    275 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.11 GBytes   950 Mbits/sec   95             sender
	[  5]   0.00-10.04  sec  1.11 GBytes   945 Mbits/sec                  receiver

Do you want to continue troubleshooting non-SACK performance since I
have a reliable way to reproduce the issue, or leave it here with "I
should have had sack enabled" ?

-Eric