netdev - Re: Debugging stuck tcp connection across localhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9330e1c7-f7a2-0f1e-0ede-c9e5353060e3@candelatech.com>
Date:   Mon, 10 Jan 2022 10:10:00 -0800
From:   Ben Greear <greearb@...delatech.com>
To:     Neal Cardwell <ncardwell@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: Debugging stuck tcp connection across localhost

On 1/6/22 2:26 PM, Ben Greear wrote:
> On 1/6/22 12:04 PM, Neal Cardwell wrote:
>> On Thu, Jan 6, 2022 at 2:05 PM Ben Greear <greearb@...delatech.com> wrote:
>>>
>>> On 1/6/22 8:16 AM, Neal Cardwell wrote:
>>>> On Thu, Jan 6, 2022 at 10:39 AM Ben Greear <greearb@...delatech.com> wrote:
>>>>>
>>>>> On 1/6/22 7:20 AM, Neal Cardwell wrote:
>>>>>> On Thu, Jan 6, 2022 at 10:06 AM Ben Greear <greearb@...delatech.com> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm working on a strange problem, and could use some help if anyone has ideas.
>>>>>>>
>>>>>>> On a heavily loaded system (500+ wifi station devices, VRF device per 'real' netdev,
>>>>>>> traffic generation on the netdevs, etc), I see cases where two processes trying
>>>>>>> to communicate across localhost with TCP seem to get a stuck network
>>>>>>> connection:
>>>>>>>
>>>>>>> [greearb@...dt7 ben_debug]$ grep 4004 netstat.txt |grep 127.0.0.1
>>>>>>> tcp        0 7988926 127.0.0.1:4004          127.0.0.1:23184         ESTABLISHED
>>>>>>> tcp        0  59805 127.0.0.1:23184         127.0.0.1:4004          ESTABLISHED
>>>>>>>
>>>>>>> Both processes in question continue to execute, and as far as I can tell, they are properly
>>>>>>> attempting to read/write the socket, but they are reading/writing 0 bytes (these sockets
>>>>>>> are non blocking).  If one was stuck not reading, I would expect netstat
>>>>>>> to show bytes in the rcv buffer, but it is zero as you can see above.
>>>>>>>
>>>>>>> Kernel is 5.15.7+ local hacks.  I can only reproduce this in a big messy complicated
>>>>>>> test case, with my local ath10k-ct and other patches that enable virtual wifi stations,
>>>>>>> but my code can grab logs at time it sees the problem.  Is there anything
>>>>>>> more I can do to figure out why the TCP connection appears to be stuck?
>>>>>>
>>>>>> It could be very useful to get more information about the state of all
>>>>>> the stuck connections (sender and receiver side) with something like:
>>>>>>
>>>>>>      ss -tinmo 'sport = :4004 or sport = :4004'
>>>>>>
>>>>>> I would recommend downloading and building a recent version of the
>>>>>> 'ss' tool to maximize the information. Here is a recipe for doing
>>>>>> that:
>>>>>>
>>>>>>     https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections
>>>
>>> Hello Neal,
>>>
>>> Here is the ss output from when the problem was happening.  I think you can ignore the non-127.0.0.1
>>> connections, but I left them in just in case it is somehow helpful.
>>>
>>> In addition, the pcap capture file is uploaded here:
>>>
>>> http://www.candelatech.com/downloads/trace-lo-4004.pcap
>>>
>>> The problem was happening in this time frame:
>>>
>>> [root@...23c-0bdd ~]# date
>>> Thu 06 Jan 2022 10:14:49 AM PST
>>> [root@...23c-0bdd ~]# ss -tinmo 'dport = :4004 or sport = :4004'
>>> State       Recv-Q       Send-Q               Local Address:Port                Peer Address:Port
>>>
>>> ESTAB       0            222024                   127.0.0.1:57224                  127.0.0.1:4004         timer:(persist,1min23sec,9)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f2232,w227144,o0,bl0,d0) ts sack reno wscale:10,4 rto:201 backoff:9 rtt:0.866/0.944 ato:40 mss:65483 pmtu:65535 
>>> rcvmss:65483
>>> advmss:65483 cwnd:10 bytes_sent:36810035 bytes_retrans:22025 bytes_acked:31729223 bytes_received:228063971 segs_out:20134 segs_in:17497 data_segs_out:11969
>>> data_segs_in:16642 send 6049237875bps lastsnd:3266 lastrcv:125252 lastack:125263 pacing_rate 12093239064bps delivery_rate 130966000000bps delivered:11863
>>> app_limited busy:275880ms rwnd_limited:21ms(0.0%) retrans:0/2 dsack_dups:2 rcv_rtt:0.671 rcv_space:1793073 rcv_ssthresh:934517 notsent:222024 minrtt:0.013
>>> ESTAB       0            0                   192.168.200.34:4004              192.168.200.34:16906
>>>           skmem:(r0,rb19521831,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:10,10 rto:201 rtt:0.483/0.64 ato:40 mss:22016 pmtu:65535 rcvmss:65483 
>>> advmss:65483
>>> cwnd:5 ssthresh:5 bytes_sent:8175956 bytes_retrans:460 bytes_acked:8174668 bytes_received:20820708 segs_out:3635 segs_in:2491 data_segs_out:2377
>>> data_segs_in:2330 send 1823271222bps lastsnd:125253 lastrcv:125250 lastack:125251 pacing_rate 2185097952bps delivery_rate 70451200000bps delivered:2372
>>> busy:14988ms rwnd_limited:1ms(0.0%) retrans:0/5 rcv_rtt:1.216 rcv_space:779351 rcv_ssthresh:9759798 minrtt:0.003
>>> ESTAB       0            139656              192.168.200.34:16908             192.168.200.34:4004         timer:(persist,1min52sec,2)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f3960,w143496,o0,bl0,d0) ts sack reno wscale:10,10 rto:37397 backoff:2 rtt:4182.62/8303.35 ato:40 mss:65483 
>>> pmtu:65535
>>> rcvmss:22016 advmss:65483 cwnd:10 bytes_sent:22351275 bytes_retrans:397320 bytes_acked:20703982 bytes_received:7815946 segs_out:2585 segs_in:3642
>>> data_segs_out:2437 data_segs_in:2355 send 1252479bps lastsnd:7465 lastrcv:125250 lastack:125253 pacing_rate 2504952bps delivery_rate 15992bps delivered:2357
>>> busy:271236ms retrans:0/19 rcv_rtt:0.004 rcv_space:288293 rcv_ssthresh:43690 notsent:139656 minrtt:0.004
>>> ESTAB       0            460                 192.168.200.34:4004              192.168.200.34:16908        timer:(on,1min23sec,9)
>>>           skmem:(r0,rb9433368,t0,tb2626560,f2356,w1740,o0,bl0,d0) ts sack reno wscale:10,10 rto:102912 backoff:9 rtt:0.741/1.167 ato:40 mss:22016 pmtu:65535
>>> rcvmss:65483 advmss:65483 cwnd:1 ssthresh:2 bytes_sent:7850211 bytes_retrans:33437 bytes_acked:7815486 bytes_received:20703981 segs_out:3672 segs_in:2504
>>> data_segs_out:2380 data_segs_in:2356 send 237689609bps lastsnd:19753 lastrcv:158000 lastack:125250 pacing_rate 854817384bps delivery_rate 115645432bps
>>> delivered:2355 busy:200993ms unacked:1 retrans:0/24 lost:1 rcv_rtt:1.439 rcv_space:385874 rcv_ssthresh:4715943 minrtt:0.003
>>> ESTAB       0            147205              192.168.200.34:16906             192.168.200.34:4004         timer:(persist,1min46sec,9)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f507,w151045,o0,bl0,d0) ts sack reno wscale:10,10 rto:223 backoff:9 rtt:11.4/18.962 ato:40 mss:65483 pmtu:65535 
>>> rcvmss:22016
>>> advmss:65483 cwnd:10 bytes_sent:23635760 bytes_retrans:220124 bytes_acked:20820709 bytes_received:8174668 segs_out:2570 segs_in:3625 data_segs_out:2409
>>> data_segs_in:2371 send 459529825bps lastsnd:7465 lastrcv:125253 lastack:125250 pacing_rate 918999184bps delivery_rate 43655333328bps delivered:2331 app_limited
>>> busy:185315ms retrans:0/14 rcv_rtt:0.005 rcv_space:220160 rcv_ssthresh:43690 notsent:147205 minrtt:0.003
>>> ESTAB       0            3928980                  127.0.0.1:4004                   127.0.0.1:57224        timer:(persist,7.639ms,8)
>>>           skmem:(r0,rb50000000,t0,tb3939840,f108,w4005780,o0,bl0,d3) ts sack reno wscale:4,10 rto:251 backoff:8 rtt:13.281/25.84 ato:40 mss:65483 pmtu:65535
>>> rcvmss:65483 advmss:65483 cwnd:10 ssthresh:10 bytes_sent:312422779 bytes_retrans:245567 bytes_acked:228063971 bytes_received:31729222 segs_out:18944
>>> segs_in:20021 data_segs_out:18090 data_segs_in:11862 send 394446201bps lastsnd:56617 lastrcv:125271 lastack:125252 pacing_rate 709983112bps delivery_rate
>>> 104772800000bps delivered:16643 app_limited busy:370468ms rwnd_limited:127ms(0.0%) retrans:0/26 rcv_rtt:7666.22 rcv_space:2279928 rcv_ssthresh:24999268
>>> notsent:3928980 minrtt:0.003
>>> [root@...23c-0bdd ~]# date
>>> Thu 06 Jan 2022 10:14:57 AM PST
>>> [root@...23c-0bdd ~]# ss -tinmo 'dport = :4004 or sport = :4004'
>>> State       Recv-Q       Send-Q               Local Address:Port                Peer Address:Port
>>>
>>> ESTAB       0            222208                   127.0.0.1:57224                  127.0.0.1:4004         timer:(persist,1min11sec,9)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f2048,w227328,o0,bl0,d0) ts sack reno wscale:10,4 rto:201 backoff:9 rtt:0.866/0.944 ato:40 mss:65483 pmtu:65535 
>>> rcvmss:65483
>>> advmss:65483 cwnd:10 bytes_sent:36941001 bytes_retrans:22025 bytes_acked:31729223 bytes_received:228063971 segs_out:20136 segs_in:17497 data_segs_out:11971
>>> data_segs_in:16642 send 6049237875bps lastsnd:2663 lastrcv:136933 lastack:136944 pacing_rate 12093239064bps delivery_rate 130966000000bps delivered:11863
>>> app_limited busy:287561ms rwnd_limited:21ms(0.0%) retrans:0/2 dsack_dups:2 rcv_rtt:0.671 rcv_space:1793073 rcv_ssthresh:934517 notsent:222208 minrtt:0.013
>>> ESTAB       0            0                   192.168.200.34:4004              192.168.200.34:16906
>>>           skmem:(r0,rb19521831,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:10,10 rto:201 rtt:0.483/0.64 ato:40 mss:22016 pmtu:65535 rcvmss:65483 
>>> advmss:65483
>>> cwnd:5 ssthresh:5 bytes_sent:8175956 bytes_retrans:460 bytes_acked:8174668 bytes_received:20820708 segs_out:3635 segs_in:2491 data_segs_out:2377
>>> data_segs_in:2330 send 1823271222bps lastsnd:136934 lastrcv:136931 lastack:136932 pacing_rate 2185097952bps delivery_rate 70451200000bps delivered:2372
>>> busy:14988ms rwnd_limited:1ms(0.0%) retrans:0/5 rcv_rtt:1.216 rcv_space:779351 rcv_ssthresh:9759798 minrtt:0.003
>>> ESTAB       0            139656              192.168.200.34:16908             192.168.200.34:4004         timer:(persist,1min40sec,2)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f3960,w143496,o0,bl0,d0) ts sack reno wscale:10,10 rto:37397 backoff:2 rtt:4182.62/8303.35 ato:40 mss:65483 
>>> pmtu:65535
>>> rcvmss:22016 advmss:65483 cwnd:10 bytes_sent:22351275 bytes_retrans:397320 bytes_acked:20703982 bytes_received:7815946 segs_out:2585 segs_in:3642
>>> data_segs_out:2437 data_segs_in:2355 send 1252479bps lastsnd:19146 lastrcv:136931 lastack:136934 pacing_rate 2504952bps delivery_rate 15992bps delivered:2357
>>> busy:282917ms retrans:0/19 rcv_rtt:0.004 rcv_space:288293 rcv_ssthresh:43690 notsent:139656 minrtt:0.004
>>> ESTAB       0            460                 192.168.200.34:4004              192.168.200.34:16908        timer:(on,1min11sec,9)
>>>           skmem:(r0,rb9433368,t0,tb2626560,f2356,w1740,o0,bl0,d0) ts sack reno wscale:10,10 rto:102912 backoff:9 rtt:0.741/1.167 ato:40 mss:22016 pmtu:65535
>>> rcvmss:65483 advmss:65483 cwnd:1 ssthresh:2 bytes_sent:7850211 bytes_retrans:33437 bytes_acked:7815486 bytes_received:20703981 segs_out:3672 segs_in:2504
>>> data_segs_out:2380 data_segs_in:2356 send 237689609bps lastsnd:31434 lastrcv:169681 lastack:136931 pacing_rate 854817384bps delivery_rate 115645432bps
>>> delivered:2355 busy:212674ms unacked:1 retrans:0/24 lost:1 rcv_rtt:1.439 rcv_space:385874 rcv_ssthresh:4715943 minrtt:0.003
>>> ESTAB       0            147205              192.168.200.34:16906             192.168.200.34:4004         timer:(persist,1min35sec,9)
>>>           skmem:(r0,rb2000000,t0,tb2000000,f507,w151045,o0,bl0,d0) ts sack reno wscale:10,10 rto:223 backoff:9 rtt:11.4/18.962 ato:40 mss:65483 pmtu:65535 
>>> rcvmss:22016
>>> advmss:65483 cwnd:10 bytes_sent:23635760 bytes_retrans:220124 bytes_acked:20820709 bytes_received:8174668 segs_out:2570 segs_in:3625 data_segs_out:2409
>>> data_segs_in:2371 send 459529825bps lastsnd:19146 lastrcv:136934 lastack:136931 pacing_rate 918999184bps delivery_rate 43655333328bps delivered:2331 app_limited
>>> busy:196996ms retrans:0/14 rcv_rtt:0.005 rcv_space:220160 rcv_ssthresh:43690 notsent:147205 minrtt:0.003
>>> ESTAB       0            3928980                  127.0.0.1:4004                   127.0.0.1:57224        timer:(persist,1min57sec,9)
>>>           skmem:(r0,rb50000000,t0,tb3939840,f108,w4005780,o0,bl0,d3) ts sack reno wscale:4,10 rto:251 backoff:9 rtt:13.281/25.84 ato:40 mss:65483 pmtu:65535
>>> rcvmss:65483 advmss:65483 cwnd:10 ssthresh:10 bytes_sent:312488262 bytes_retrans:245567 bytes_acked:228063971 bytes_received:31729222 segs_out:18945
>>> segs_in:20021 data_segs_out:18091 data_segs_in:11862 send 394446201bps lastsnd:2762 lastrcv:136952 lastack:136933 pacing_rate 709983112bps delivery_rate
>>> 104772800000bps delivered:16643 app_limited busy:382149ms rwnd_limited:127ms(0.0%) retrans:0/26 rcv_rtt:7666.22 rcv_space:2279928 rcv_ssthresh:24999268
>>> notsent:3928980 minrtt:0.003
>>> [root@...23c-0bdd ~]#
>>>
>>>
>>> We can reproduce this readily at current, and I'm happy to try patches and/or do more debugging.  We also tried with a 5.12 kernel,
>>> and saw same problems, but in all cases, we have local patches applied, and there is no way for us to do this test without
>>> at least a fair bit of local patches applied.
>>
>> Thanks for the ss traces and tcpdump output! The tcpdump traces are
>> nice, in that they start before the connection starts, so capture the
>> SYN and its critical options like wscale.
>>
>>  From the "timer:(persist" in the ss output, it seems the stalls (that
>> are preventing the send buffers from being transmitted) are caused by
>> a 0-byte receive window causing the senders to stop sending, and
>> periodically fire the ICSK_TIME_PROBE0 timer to check for an open
>> receive window. From "backoff:9" it seems this condition has lasted
>> for a very long exponential backoff process.
>>
>> I don't see 0-byte receive window problems in the trace, but this is
>> probably because the tcpdump traces only last through 10:12:47 PST,
>> and the problem is showing up in ss at 10:14:49 AM PST and later.
>>
>> Is it possible to reproduce the problem again, and this time let the
>> tcpdump traces run all the way through the period where the
>> connections freeze and you grab the "ss" output?
>>
>> You may also have to explicitly kill the tcpdump. Perhaps the tail of
>> the trace was buffered in tcpdump's output buffer and not flushed to
>> disk. A "killall tcpdump" should do the trick to force it to cleanly
>> flush everything.
> 
> Here is another set of debugging, I made sure tcpdump ran the entire time,
> as well as the ss monitoring script.
> 
> http://www.candelatech.com/downloads/ss_log.txt
> http://www.candelatech.com/downloads/trace-lo-4004-b.pcap
> 
> In addition, here are logs from my tool with msec timestamps.  It is detecting
> communication failure and logging about it.  Interestingly, I think it recovered
> after one long timeout, but in the end, it went past the 2-minute cutoff mark
> where my program will close the TCP connection and restart things.
> 
> 1641506767983:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 34458ms, sending req for update, read-isset: 0
> 1641506773839:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 40314ms, sending req for update, read-isset: 0
> 1641506780563:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 47038ms, sending req for update, read-isset: 0
> 1641506786567:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 53041ms, sending req for update, read-isset: 0
> 1641506823537:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 34949ms, sending req for update, read-isset: 0
> 1641506829280:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 40692ms, sending req for update, read-isset: 0
> 1641506834878:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 46289ms, sending req for update, read-isset: 0
> 1641506840778:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 52189ms, sending req for update, read-isset: 0
> 1641506846786:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 58198ms, sending req for update, read-isset: 0
> 1641506852746:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 64158ms, sending req for update, read-isset: 0
> 1641506858280:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 69692ms, sending req for update, read-isset: 0
> 1641506864200:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 75612ms, sending req for update, read-isset: 0
> 1641506870556:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 81968ms, sending req for update, read-isset: 0
> 1641506876564:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 87976ms, sending req for update, read-isset: 0
> 1641506882774:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 94185ms, sending req for update, read-isset: 0
> 
> # Recovered between here and above it seems.
> 
> 1641507005029:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 35840ms, sending req for update, read-isset: 0
> 1641507035759:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 30164ms, sending req for update, read-isset: 0
> 1641507042161:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 36565ms, sending req for update, read-isset: 0
> 1641507048397:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 42802ms, sending req for update, read-isset: 0
> 1641507054491:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 48896ms, sending req for update, read-isset: 0
> 1641507060748:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 55153ms, sending req for update, read-isset: 0
> 1641507067083:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 61488ms, sending req for update, read-isset: 0
> 1641507073438:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 67842ms, sending req for update, read-isset: 0
> 1641507079638:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 74042ms, sending req for update, read-isset: 0
> 1641507085926:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 80330ms, sending req for update, read-isset: 0
> 1641507091788:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 86192ms, sending req for update, read-isset: 0
> 1641507098042:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 92447ms, sending req for update, read-isset: 0
> 1641507104283:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 98687ms, sending req for update, read-isset: 0
> 1641507110466:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 104871ms, sending req for update, read-isset: 0
> 1641507116381:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 110786ms, sending req for update, read-isset: 0
> 1641507123034:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 117439ms, sending req for update, read-isset: 0
> 1641507128975:  Card.cc 801: WARNING:  Card: Shelf: 1, Card: 1 has not received communication in: 123379ms, sending req for update, read-isset: 0

Hello Neal,

Do the new captures help any?

 From my own looking at things, it seems that the sniffer fails to get frames near when the problem
starts happening.  I am baffled as to how that can happen, especially since it seems to stop getting
packets from multiple different TCP connections (the sniffer filter would pick up some other loop-back
related connections to the same IP port).

And, if I interpret the ss output properly, after the problem happens, the sockets still think they are
sending data.  I didn't check closely enough to see if the peer side thought it received it.

We are going to try to reproduce w/out wifi, but not sure we'll have any luck with that.
We did test w/out VRF (using lots of ip rules instead), and similar problem was seen according to my
test team (I did not debug it in detail).

Do you have any suggestions for how to debug this further?  I am happy to hack stuff into the
kernel if you have some suggested places to add debugging...

Thanks,
Ben