netdev - Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQy=kB-B-9rAOgSjBAh+KHx4pkz-VoTnBZ0ye+Fp4hjicPA@mail.gmail.com>
Date: Sat, 7 Jun 2025 18:54:27 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Wheeler <netdev@...ts.ewheeler.net>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>, 
	Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>, stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
 retransmits were sent

On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> >
> > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > >
> > > > Hello Neal,
> > > >
> > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > >
> > > > Interestingly, the problem only presents itself when transmitting
> > > > from Linux; receive traffic (to Linux) performs just fine:
> > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > >
> > > > Through bisection, we found this first-bad commit:
> > > >
> > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > >
> > >
> > > Thank you for your detailed report and your offer to run some more tests!
> > >
> > > I don't have any good theories yet. It is striking that the apparent
> > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > than in your "After Revert" case. It seems like something very odd is
> > > going on. :-)
> >
> > good point, I wonder what that might imply...
> >
> > > If you could re-run tests while gathering more information, and share
> > > that information, that would be very useful.
> > >
> > > What would be very useful would be the following information, for both
> > > (a) Before Revert, and (b) After Revert kernels:
> > >
> > > # as root, before the test starts, start instrumentation
> > > # and leave it running in the background; something like:
> > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > > /tmp/nstat.txt &
> > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> > >
> > > # then run the test
> > >
> > > # then kill the instrumentation loops running in the background:
> > > kill %1 %2 %3
> >
> > Sure, here they are:
> >
> >         https://www.linuxglobal.com/out/for-neal/
>
> Hi Eric,
>
> Many thanks for the traces! These traces clearly show the buggy
> behavior. The problem is an interaction between the non-SACK behavior
> on these connections (due to the non-Linux "device" not supporting
> SACK) and the undo logic. The problem is that, for non-SACK
> connections, tcp_is_non_sack_preventing_reopen() holds steady in
> CA_Recovery or CA_Loss at the end of a loss recovery episode but
> clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> to allow timestamp undo if no retransmits were sent" sees the
> tp->retrans_stamp at 0 and erroneously concludes that no data was
> retransmitted, and erroneously performs an undo of the cwnd reduction,
> restoring cwnd immediately to the value it had before loss recovery.
> This causes an immediate build-up of queues and another immediate loss
> recovery episode. Thus the higher retransmit rate in the buggy
> scenario.
>
> I will work on a packetdrill reproducer, test a fix, and post a patch
> for testing. I think the simplest fix would be to have
> tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> allow tcp_packet_delayed() to return true in that case. That should be
> a precise fix for this scenario and does not risk changing behavior
> for the much more common case of loss recovery with SACK support.

Indeed, I'm able to reproduce this issue with erroneous undo events on
non-SACK connections at the end of loss recovery with the attached
packetdrill script.

When you run that script on a kernel with the "tcp: fix to allow
timestamp undo if no retransmits were sent" patch, we see:

+ nstat shows an erroneous TcpExtTCPFullUndo event

+ the loss recovery reduces cwnd from the initial 10 to the correct 7
(from CUBIC) but then the erroneous undo event restores the pre-loss
cwnd of 10 and leads to a final cwnd value of 11

I will test a patch with the proposed fix and report back.

neal