netdev - Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQy=SLM6vyWr5-UGg6TFU+b0g4s=A0h2ujRpphTyuxDYXKA@mail.gmail.com>
Date: Sat, 7 Jun 2025 15:13:11 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Wheeler <netdev@...ts.ewheeler.net>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>, 
	Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>, stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
 retransmits were sent

On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
>
> On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > >
> > > Hello Neal,
> > >
> > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > across the switch on 10Gbit ports runs at full 10GbE.
> > >
> > > Interestingly, the problem only presents itself when transmitting
> > > from Linux; receive traffic (to Linux) performs just fine:
> > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > >
> > > Through bisection, we found this first-bad commit:
> > >
> > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > >
> >
> > Thank you for your detailed report and your offer to run some more tests!
> >
> > I don't have any good theories yet. It is striking that the apparent
> > retransmit rate is more than 100x higher in your "Before Revert" case
> > than in your "After Revert" case. It seems like something very odd is
> > going on. :-)
>
> good point, I wonder what that might imply...
>
> > If you could re-run tests while gathering more information, and share
> > that information, that would be very useful.
> >
> > What would be very useful would be the following information, for both
> > (a) Before Revert, and (b) After Revert kernels:
> >
> > # as root, before the test starts, start instrumentation
> > # and leave it running in the background; something like:
> > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > /tmp/nstat.txt &
> > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> >
> > # then run the test
> >
> > # then kill the instrumentation loops running in the background:
> > kill %1 %2 %3
>
> Sure, here they are:
>
>         https://www.linuxglobal.com/out/for-neal/

Hi Eric,

Many thanks for the traces! These traces clearly show the buggy
behavior. The problem is an interaction between the non-SACK behavior
on these connections (due to the non-Linux "device" not supporting
SACK) and the undo logic. The problem is that, for non-SACK
connections, tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss at the end of a loss recovery episode but
clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
to allow timestamp undo if no retransmits were sent" sees the
tp->retrans_stamp at 0 and erroneously concludes that no data was
retransmitted, and erroneously performs an undo of the cwnd reduction,
restoring cwnd immediately to the value it had before loss recovery.
This causes an immediate build-up of queues and another immediate loss
recovery episode. Thus the higher retransmit rate in the buggy
scenario.

I will work on a packetdrill reproducer, test a fix, and post a patch
for testing. I think the simplest fix would be to have
tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
(tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
allow tcp_packet_delayed() to return true in that case. That should be
a precise fix for this scenario and does not risk changing behavior
for the much more common case of loss recovery with SACK support.

Eric, would you be willing to test a simple bug fix patch for this?

Thanks!

neal