netdev - Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQym2vbWXHSVhyc6-QZLg-ASJfM-SCzu1tRsyapD_ms9S_w@mail.gmail.com>
Date: Thu, 26 Jun 2025 10:21:59 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Wheeler <netdev@...ts.ewheeler.net>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>, 
	Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>, stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
 retransmits were sent

On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
>
> On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > >
> > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > >
> > > > > > >
> > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > easier.
> > > > > >
> > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > retransmit counts are still higher than the other.  In the two sections
> > > > > > below you can see the difference between after the fix and after the
> > > > > > revert.
> > > > > >
> > > > >
> > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > >
> > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > average and geometric mean, and it still appears to be slightly slower
> > > > then with the revert alone:
> > > >
> > > >       # after-revert-6.6.93
> > > >       Arithmetic Mean: 843.64 Mbits/sec
> > > >       Geometric Mean: 841.95 Mbits/sec
> > > >
> > > >       # after-tcp-fix-6.6.93
> > > >       Arithmetic Mean: 823.00 Mbits/sec
> > > >       Geometric Mean: 819.38 Mbits/sec
> > > >
> > >
> > > Re-sending this question in case this message got lost:
> > >
> > > > Do you think that this is an actual performance regression, or just a
> > > > sample set that is not big enough to work out the averages?
> > > >
> > > > Here is the data collected for each of the 50 tests:
> > > >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> >
> > Hi Eric,
> >
> > Many thanks for this great data!
> >
> > I have been looking at this data. It's quite interesting.
> >
> > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > cases (attached) it does look like for the 70-th percentile and below
> > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > that is lower, and IMHO this looks outside the realm of what we would
> > expect from noise.
> >
> > However, when I look at the traces, I don't see any reason why the
> > "fix" cases would be systematically slower. In particular, the "fix"
> > and "revert" cases are only changing a function used for "undo"
> > decisions, but for both the "fix" or "revert" cases, there are no
> > "undo" events, and I don't see cases with spurious retransmissions
> > where there should have been "undo" events and yet there were not.
> >
> > Visually inspecting the traces, the dominant determinant of
> > performance seems to be how many RTO events there were. For example,
> > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > per trial looks random; I see similar qualitative patterns between
> > "fix" and "revert" cases, and don't see any reason why there are more
> > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > be due to pre-existing (longstanding) performance problems in non-SACK
> > loss recovery.
> >
> > One way to proceed would be for me to offer some performance fixes for
> > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > of performance variation. That should greatly reduce noise, and
> > perhaps make it easier to see if there is any real difference between
> > "fix" and "revert" cases.
> >
> > We could compare the following two kernels, with another 50 tests for
> > each of two kernels:
> >
> > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> >
> > where:
> >
> > "revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > no retransmits were sent")
> > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > tcp_is_non_sack_preventing_reopen() behavior"
> >
> > This would have the side benefit of testing some performance
> > improvements for non-SACK connections.
> >
> > Are you up for that? :-)
>
>
> Sure, if you have some patch ideas in mind, I'm all for getting patches
> merged improve performance.

Great! Thanks for being willing to do this! I will try to post some
patches ASAP.

> BTW, what causes a non-SACK connection?  The RX side is a near-idle Linux
> 6.8 host default sysctl settings.

Given the RX side is a Linux 6.8 host, the kernel should be supporting
TCP SACK due to kernel compile-time defaults (see the
"net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.

Given that factor, off-hand, I can think of only a few reasons why the
RX side would not negotiate SACK support:

(1) Some script or software on the RX machine has disabled SACK via
"sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
is easy to check with "sysctl net.ipv4.tcp_sack").

(2) There is a middlebox on the path (doing firewalling or NAT, etc)
that disables SACK

(3) There is a firewall rule on some machine or router/switch that disables SACK

Off-hand, I would think that (2) is the most likely case, since
intentionally disabling SACK via sysctl or firewall rule is
inadvisable and rare.

Any thoughts on which of these might be in play here?

Thanks,
neal