netdev - Re: Debugging stuck tcp connection across localhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQyn97m5ybVZ3FdWAw85gOMLAvPSHiR8_NC_nGFyBdRySqQ@mail.gmail.com>
Date:   Thu, 6 Jan 2022 10:20:32 -0500
From:   Neal Cardwell <ncardwell@...gle.com>
To:     Ben Greear <greearb@...delatech.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: Debugging stuck tcp connection across localhost

On Thu, Jan 6, 2022 at 10:06 AM Ben Greear <greearb@...delatech.com> wrote:
>
> Hello,
>
> I'm working on a strange problem, and could use some help if anyone has ideas.
>
> On a heavily loaded system (500+ wifi station devices, VRF device per 'real' netdev,
> traffic generation on the netdevs, etc), I see cases where two processes trying
> to communicate across localhost with TCP seem to get a stuck network
> connection:
>
> [greearb@...dt7 ben_debug]$ grep 4004 netstat.txt |grep 127.0.0.1
> tcp        0 7988926 127.0.0.1:4004          127.0.0.1:23184         ESTABLISHED
> tcp        0  59805 127.0.0.1:23184         127.0.0.1:4004          ESTABLISHED
>
> Both processes in question continue to execute, and as far as I can tell, they are properly
> attempting to read/write the socket, but they are reading/writing 0 bytes (these sockets
> are non blocking).  If one was stuck not reading, I would expect netstat
> to show bytes in the rcv buffer, but it is zero as you can see above.
>
> Kernel is 5.15.7+ local hacks.  I can only reproduce this in a big messy complicated
> test case, with my local ath10k-ct and other patches that enable virtual wifi stations,
> but my code can grab logs at time it sees the problem.  Is there anything
> more I can do to figure out why the TCP connection appears to be stuck?

It could be very useful to get more information about the state of all
the stuck connections (sender and receiver side) with something like:

  ss -tinmo 'sport = :4004 or sport = :4004'

I would recommend downloading and building a recent version of the
'ss' tool to maximize the information. Here is a recipe for doing
that:

 https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections

It could also be very useful to collect and share packet traces, as
long as taking traces does not consume an infeasible amount of space,
or perturb timing in a way that makes the buggy behavior disappear.
For example, as root:

  tcpdump -w /tmp/trace.pcap -s 120 -c 100000000 -i any port 4004 &

If space is an issue, you might start taking traces once things get
stuck to see what the retry behavior, if any, looks like.

thanks,
neal