[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQynf8f7SUtZ8iQi-fACYLpAyLqDKQVYKN-mkEgVtFUTVXQ@mail.gmail.com>
Date: Wed, 30 Mar 2022 09:56:45 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Jaco Kroon <jaco@....co.za>
Cc: Eric Dumazet <edumazet@...gle.com>,
LKML <linux-kernel@...r.kernel.org>,
Netdev <netdev@...r.kernel.org>,
Yuchung Cheng <ycheng@...gle.com>
Subject: Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections
On Wed, Mar 30, 2022 at 2:22 AM Jaco Kroon <jaco@....co.za> wrote:
>
> Hi Eric,
>
> On 2022/03/30 05:48, Eric Dumazet wrote:
> > On Tue, Mar 29, 2022 at 7:58 PM Jaco Kroon <jaco@....co.za> wrote:
> >
> > I do not think this commit is related to the issue you have.
> >
> > I guess you could try a revert ?
> >
> > Then, if you think old linux versions were ok, start a bisection ?
> That'll be interesting, will see if I can reproduce on a non-production
> host.
> >
> > Thank you.
> >
> > (I do not see why a successful TFO would lead to a freeze after ~70 KB
> > of data has been sent)
>
> I do actually agree with this in that it makes no sense, but disabling
> TFO definitely resolved the issue for us.
>
> Kind Regards,
> Jaco
Thanks for the pcap trace! That's a pretty strange trace. I agree with
Eric's theory that this looks like one or more bugs in a firewall,
middlebox, or netfilter rule. From the trace it looks like the buggy
component is sometimes dropping packets and sometimes corrupting them
so that the client's TCP stack ignores them.
Interestingly, in that trace the client SYN has a TFO option and
cookie, but no data in the SYN.
The last packet that looks sane/normal is the ACK from the SMTP server
that looks like:
00:00:00.000010 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 66263 win
774 <nop,nop,TS val 1206544341 ecr 331189186>
That's the first ACK that crosses past 2^16. Maybe that is a
coincidence, or maybe not. Perhaps the buggy firewall/middlebox/etc is
confused by the TFO option, corrupts its state, and thereafter behaves
incorrectly past the first 64 KBytes of data from the client.
In addition to checking for checksum failures, mentioned by Eric, you
could look for PAWS failures, something like:
nstat -az | egrep -i 'TcpInCsumError|PAWS'
best,
neal
Powered by blists - more mailing lists