linux-kernel - Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyn=A9EuTwxe-Bd9qgD24PLQ02YQy0_b7YWZj4_rqhWRVA@mail.gmail.com>
Date:   Tue, 29 Mar 2022 22:01:05 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     Jaco <jaco@....co.za>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

On Tue, Mar 29, 2022 at 9:03 PM Jaco <jaco@....co.za> wrote:
>
> Dear All,
>
> I'm seeing very strange TCP behaviour.  Disabled TCP Segmentation Offload to
> try and pinpoint this more closely.
>
> It seems the kernel is ignoring ACKs coming from the remote side in some cases.
> In this case, on one of four hosts, and seemingly between this one host and
> Google ... (We've have two emails to google stuck on another host due to same
> issue, but several hundred others passed out today on that same host).  I also
> killed selective ACKs as a test as these are known to sometimes cause issues
> for firewalls and "tcp accelerators" (or used to at the very least).
>
> SMTP connection between ourselves and Google ... I'm going to be selective in
> copying from tcpdump (full coversation up to the point where I killed it
> because it plainly got stuck in a loop is attached).
>
> Connection setup:
>
> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo  cookie f025dd84b6122510,nop,nop], length 0
>
> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
>
> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
>
> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
>
> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> we shouldn't send segments larger than that, and they "can't".  I need to
> determine if this is some form of offloading or they really are sending >1500
> byte frames (which I know won't pass our firewalls without fragmentation so
> probably some form of NIC offloading - which if it was active on older 5.8
> kernels did not cause problems):
>
> 00:56:17.709905 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465979:726468395, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
>
> 00:56:17.709906 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726468395:726470811, ack 956634111, win 261, options [nop,nop,TS val 3477429710 ecr 3687705973], length 2416: SMTP
>
> These are the only two frames I can find that supposedly exceeds the MSS values
> (although, they don't exceed our value).
>
> Then everything goes pretty normal for a bit.  The last data we receive from
> the remote side before stuff goes wrong:
>
> 00:56:18.088725 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726471823:726471919, ack 956634348, win 261, options [nop,nop,TS val 3477430089 ecr 3687706330], length 96: SMTP
>
> We ACK immediately along with the next segment:
>
> 00:56:18.088969 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956634348:956635776, ack 726471919, win 446, options [nop,nop,TS val 3687706515 ecr 3477430089], length 1428: SMTP
>
> Hereafter there is a flurry of data that we transmit, all nicely acknowledged,
> no retransmits that I can pick up (eyeballs).
>
> Before a long sequence of TX data we get this ACK:
>
> 00:56:18.576247 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956700036, win 774, options [nop,nop,TS val 3477430577 ecr 3687706840], length 0
>
> We then continue to RX a sequence of:
>
> 00:56:18.576300 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956745732:956747160, ack 726471919, win 446, options [nop,nop,TS val 3687707002 ecr 3477430577], length 1428: SMTP
>
> up to:
>
> 00:56:18.577031 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956778576:956780004, ack 726471919, win 446, options [nop,nop,TS val 3687707003 ecr 3477430577], length 1428: SMTP
>
> Before we hit our first retransmit:
>
> 00:56:18.960078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956700036:956701464, ack 726471919, win 446, options [nop,nop,TS val 3687707386 ecr 3477430577], length 1428: SMTP
>
> Since 956700036 is the last ACKed data, this seems correct, not sure what timer
> this is based on though, the ACK for the just prior data came in ~384ms prior
> (could be based on normal time to ACK, I don't know, this is about double the
> usual round-trip-time currently).
>
> And then we receive this ACK (we can see this time the kernel waited for ACK of
> this single segment):
>
> 00:56:19.126678 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956701464, win 785, options [nop,nop,TS val 3477431127 ecr 3687707386], length 0
>
> Then we do something (in my opinion) strange by jumping back to the tail of the previous burst:
>
> 00:56:19.126735 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956780004:956781432, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
>
> 00:56:19.126751 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956781432:956782860, ack 726471919, win 446, options [nop,nop,TS val 3687707553 ecr 3477431127], length 1428: SMTP
>
> We then jump back and retransmit again from the just received ACK:
>
> 00:56:19.510078 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956701464:956702892, ack 726471919, win 446, options [nop,nop,TS val 3687707936 ecr 3477431127], length 1428: SMTP
>
> We then continue from there on as I'd expect (slow restart), this goes pretty
> normal up to:
>
> 00:56:19.997088 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956708604, win 841, options [nop,nop,TS val 3477431998 ecr 3687708261], length 0
>
> 00:56:19.997148 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687708423 ecr 3477431998], length 1428: SMTP
>
> 00:56:20.262683 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432263 ecr 3687708423], length 0
>
> Up to here is fine, now things gets bizarre, we just jump to a different
> sequence number, which has already been ACKed:
>
> 00:56:20.380076 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq *956707176*:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687708806 ecr 3477431998], length 1428: SMTP
>
> 00:56:20.542356 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477432543 ecr 3687708423], length 0
>
> And remote side re-ACKs the 956710032 value, which frankly indicates we need to
> realize that the data we are transmitting has already been received, and we can
> continue on to transmit the segments following up on sequence number 956710032,
> instead we choose to get stuck in this sequence:
>
> 00:56:21.180080 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687709606 ecr 3477431998], length 1428: SMTP
>
> 00:56:21.342347 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477433343 ecr 3687708423], length 0
>
> 00:56:22.780101 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687711206 ecr 3477431998], length 1428: SMTP
>
> 00:56:22.942346 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [.], ack 956710032, win 852, options [nop,nop,TS val 3477434943 ecr 3687708423], length 0
>
> And here the connection dies.  It eventually times out, and we retry to the
> next host, resulting in the same problem.
>
> I am aware that Google is having congestion issues in the JHB area in SA
> currently, and there are probably packet delays and losses somewhere along the
> line between us, but this really should not stall as dead as it is here.
>
> Looking at only the incoming ACK values, I can see they are strictly
> increasing, so we've never received an ACK > 956710032, but this is still
> greater than the value we are retransmitting.
>
> The first time we transmitted the frame at sequence number 956707176 was part
> of the longest sequence of TX frames without a returning ACK, part of this
> sequence:
>
> ...
>
> 00:56:18.414299 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956705748:956707176, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> 00:56:18.414302 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [P.], seq 956707176:956708604, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> 00:56:18.414316 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], seq 956708604:956710032, ack 726471919, win 446, options [nop,nop,TS val 3687706840 ecr 3477430415], length 1428: SMTP
>
> ...
>
> Google here is ACKing not only the frame we are continuously retransmitting,
> but also the frame directly after ... so why would the kernel not move on to
> retransmitting starting from sequence number 956710032 (which is larger than
> the start sequence number of the frame we are retransmitting)?
>
> Kind Regards,
> Jaco

Thanks for the report!  I have CC-ed the netdev list, since it is
probably a better forum for this discussion.

Can you please attach (or link to) a tcpdump raw .pcap file  (produced
with the -w flag)? There are a number of tools that will make this
easier to visualize and analyze if we can see the raw .pcap file. You
may want to anonymize the trace and/or capture just headers, etc (for
example, the -s flag can control how much of each packet tcpdump
grabs).

Can you please share the exact kernel version of the client machine?

Also, can you please summarize/clarify whether you think the client,
server, or both are misbehaving?

Thanks!
neal

View attachment "iewc_google.txt" of type "text/plain" (38985 bytes)