netdev - Re: Debugging stuck tcp connection across localhost [snip]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a31557d8-13da-07e2-7a64-ce07e786f25c@candelatech.com>
Date:   Wed, 12 Jan 2022 10:44:37 -0800
From:   Ben Greear <greearb@...delatech.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: Debugging stuck tcp connection across localhost [snip]

On 1/12/22 10:01 AM, Ben Greear wrote:
> On 1/12/22 9:12 AM, Eric Dumazet wrote:
>> On Wed, Jan 12, 2022 at 6:52 AM Ben Greear <greearb@...delatech.com> wrote:
>>>
>>> On 1/11/22 11:41 PM, Eric Dumazet wrote:
>>>> On Tue, Jan 11, 2022 at 1:35 PM Ben Greear <greearb@...delatech.com> wrote:
>>>>>
>>>>> On 1/11/22 2:46 AM, Eric Dumazet wrote:
>>>>>>
> 
>>>> Just to clarify:
>>>>
>>>> Have you any qdisc on lo interface ?
>>>>
>>>> Can you try:
>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>> index 5079832af5c1090917a8fd5dfb1a3025e2d85ae0..81a26ce4d79fd48f870b5c1d076a9082950e2a57
>>>> 100644
>>>> --- a/net/ipv4/tcp_output.c
>>>> +++ b/net/ipv4/tcp_output.c
>>>> @@ -2769,6 +2769,7 @@ bool tcp_schedule_loss_probe(struct sock *sk,
>>>> bool advancing_rto)
>>>>    static bool skb_still_in_host_queue(struct sock *sk,
>>>>                                       const struct sk_buff *skb)
>>>>    {
>>>> +#if 0
>>>>           if (unlikely(skb_fclone_busy(sk, skb))) {
>>>>                   set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>>>>                   smp_mb__after_atomic();
>>>> @@ -2778,6 +2779,7 @@ static bool skb_still_in_host_queue(struct sock *sk,
>>>>                           return true;
>>>>                   }
>>>>           }
>>>> +#endif
>>>>           return false;
>>>>    }
>>>>
>>>
>>> I will try that today.
>>>
>>> I don't think I have qdisc on lo:
>>>
>>> # tc qdisc show|grep 'dev lo'
>>> qdisc noqueue 0: dev lo root refcnt 2
>>
>> Great, I wanted to make sure you were not hitting some bug there
>> (pfifo_fast has been buggy for many kernel versions)
>>
>>>
>>> The eth ports are using fq_codel, and I guess they are using mq as well.
>>>
>>> We moved one of the processes off of the problematic machine so that it communicates over
>>> Ethernet instead of 'lo', and problem seems to have gone away.  But, that also
>>> changes system load, so it could be coincidence.
>>>
>>> Also, conntrack -L showed nothing on a machine with simpler config where the two problematic processes
>>> are talking over 'lo'.  The machine that shows problem does have a lot of conntrack entries because it
>>> is also doing some NAT for other data connections, but I don't think this should affect the 127.0.0.1 traffic.
>>> There is a decent chance I mis-understand your comment about conntrack though...
>>
>> This was a wild guess. Honestly, I do not have a smoking gun yet.
> 
> I tried your patch above, it did not help.
> 
> Also, looks like maybe we reproduced same issue with processes on different
> machines, but I was not able to verify it was the same root cause, and at
> least, it was harder to reproduce.
> 
> I'm back to testing in the easily reproducible case now.
> 
> I have a few local patches in the general networking path, I'm going to
> attempt to back those out just in case my patches are buggy.

Well, I think maybe I found the problem.

I looked in the right place at the right time and saw that the kernel was spewing about
neigh entries being full.  The defaults are too small for the number of interfaces
we are using.  Our script that was supposed to set the thresholds higher had a typo
in it that caused it to not actually set the values.

When the neigh entries are fully consumed, then even communication across 127.0.0.1
fails in somewhat mysterious ways, and I guess this can break existing connections
too, not just new connections.

We'll do some more testing with the thresh setting fix in...always a chance there is more than one
problem in this area.

Thanks,
Ben