netdev - Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyna9cMvJf9Mp5jLR1vryAY1rEbAjZC_ef=Q8HRM4tNFzQ@mail.gmail.com>
Date: Sat, 7 Jun 2025 19:26:26 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Wheeler <netdev@...ts.ewheeler.net>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Geumhwan Yu <geumhwan.yu@...sung.com>, Jakub Kicinski <kuba@...nel.org>, 
	Sasha Levin <sashal@...nel.org>, Yuchung Cheng <ycheng@...gle.com>, stable@...nel.org
Subject: Re: [BISECT] regression: tcp: fix to allow timestamp undo if no
 retransmits were sent

On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > >
> > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@...ts.ewheeler.net> wrote:
> > > > >
> > > > > Hello Neal,
> > > > >
> > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > >
> > > > > Interestingly, the problem only presents itself when transmitting
> > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > >
> > > > > Through bisection, we found this first-bad commit:
> > > > >
> > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > >
> > > >
> > > > Thank you for your detailed report and your offer to run some more tests!
> > > >
> > > > I don't have any good theories yet. It is striking that the apparent
> > > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > > than in your "After Revert" case. It seems like something very odd is
> > > > going on. :-)
> > >
> > > good point, I wonder what that might imply...
> > >
> > > > If you could re-run tests while gathering more information, and share
> > > > that information, that would be very useful.
> > > >
> > > > What would be very useful would be the following information, for both
> > > > (a) Before Revert, and (b) After Revert kernels:
> > > >
> > > > # as root, before the test starts, start instrumentation
> > > > # and leave it running in the background; something like:
> > > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > > > /tmp/nstat.txt &
> > > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> > > >
> > > > # then run the test
> > > >
> > > > # then kill the instrumentation loops running in the background:
> > > > kill %1 %2 %3
> > >
> > > Sure, here they are:
> > >
> > >         https://www.linuxglobal.com/out/for-neal/
> >
> > Hi Eric,
> >
> > Many thanks for the traces! These traces clearly show the buggy
> > behavior. The problem is an interaction between the non-SACK behavior
> > on these connections (due to the non-Linux "device" not supporting
> > SACK) and the undo logic. The problem is that, for non-SACK
> > connections, tcp_is_non_sack_preventing_reopen() holds steady in
> > CA_Recovery or CA_Loss at the end of a loss recovery episode but
> > clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> > to allow timestamp undo if no retransmits were sent" sees the
> > tp->retrans_stamp at 0 and erroneously concludes that no data was
> > retransmitted, and erroneously performs an undo of the cwnd reduction,
> > restoring cwnd immediately to the value it had before loss recovery.
> > This causes an immediate build-up of queues and another immediate loss
> > recovery episode. Thus the higher retransmit rate in the buggy
> > scenario.
> >
> > I will work on a packetdrill reproducer, test a fix, and post a patch
> > for testing. I think the simplest fix would be to have
> > tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> > (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> > allow tcp_packet_delayed() to return true in that case. That should be
> > a precise fix for this scenario and does not risk changing behavior
> > for the much more common case of loss recovery with SACK support.
>
> Indeed, I'm able to reproduce this issue with erroneous undo events on
> non-SACK connections at the end of loss recovery with the attached
> packetdrill script.
>
> When you run that script on a kernel with the "tcp: fix to allow
> timestamp undo if no retransmits were sent" patch, we see:
>
> + nstat shows an erroneous TcpExtTCPFullUndo event
>
> + the loss recovery reduces cwnd from the initial 10 to the correct 7
> (from CUBIC) but then the erroneous undo event restores the pre-loss
> cwnd of 10 and leads to a final cwnd value of 11
>
> I will test a patch with the proposed fix and report back.

Oops, forgot to attach the packetdrill script! Let's try again...

neal

Download attachment "fr-non-sack-hold-at-high-seq.pkt" of type "application/octet-stream" (2362 bytes)