[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyn97m5ybVZ3FdWAw85gOMLAvPSHiR8_NC_nGFyBdRySqQ@mail.gmail.com>
Date: Thu, 6 Jan 2022 10:20:32 -0500
From: Neal Cardwell <ncardwell@...gle.com>
To: Ben Greear <greearb@...delatech.com>
Cc: netdev <netdev@...r.kernel.org>
Subject: Re: Debugging stuck tcp connection across localhost
On Thu, Jan 6, 2022 at 10:06 AM Ben Greear <greearb@...delatech.com> wrote:
>
> Hello,
>
> I'm working on a strange problem, and could use some help if anyone has ideas.
>
> On a heavily loaded system (500+ wifi station devices, VRF device per 'real' netdev,
> traffic generation on the netdevs, etc), I see cases where two processes trying
> to communicate across localhost with TCP seem to get a stuck network
> connection:
>
> [greearb@...dt7 ben_debug]$ grep 4004 netstat.txt |grep 127.0.0.1
> tcp 0 7988926 127.0.0.1:4004 127.0.0.1:23184 ESTABLISHED
> tcp 0 59805 127.0.0.1:23184 127.0.0.1:4004 ESTABLISHED
>
> Both processes in question continue to execute, and as far as I can tell, they are properly
> attempting to read/write the socket, but they are reading/writing 0 bytes (these sockets
> are non blocking). If one was stuck not reading, I would expect netstat
> to show bytes in the rcv buffer, but it is zero as you can see above.
>
> Kernel is 5.15.7+ local hacks. I can only reproduce this in a big messy complicated
> test case, with my local ath10k-ct and other patches that enable virtual wifi stations,
> but my code can grab logs at time it sees the problem. Is there anything
> more I can do to figure out why the TCP connection appears to be stuck?
It could be very useful to get more information about the state of all
the stuck connections (sender and receiver side) with something like:
ss -tinmo 'sport = :4004 or sport = :4004'
I would recommend downloading and building a recent version of the
'ss' tool to maximize the information. Here is a recipe for doing
that:
https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections
It could also be very useful to collect and share packet traces, as
long as taking traces does not consume an infeasible amount of space,
or perturb timing in a way that makes the buggy behavior disappear.
For example, as root:
tcpdump -w /tmp/trace.pcap -s 120 -c 100000000 -i any port 4004 &
If space is an issue, you might start taking traces once things get
stuck to see what the retry behavior, if any, looks like.
thanks,
neal
Powered by blists - more mailing lists