[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S34dmOWiEtOFQMc1FGhqvLqxM5JkD6q5MvaZz0cGZmNM=A@mail.gmail.com>
Date: Thu, 15 Dec 2016 14:39:26 -0800
From: Tom Herbert <tom@...bertland.com>
To: Josef Bacik <jbacik@...com>
Cc: Craig Gallek <kraigatgoog@...il.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Eric Dumazet <eric.dumazet@...il.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jbacik@...com> wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@...bertland.com> wrote:
>>
>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatgoog@...il.com>
>> wrote:
>>>
>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <tom@...bertland.com>
>>> wrote:
>>>>
>>>> I think there may be some suspicious code in inet_csk_get_port. At
>>>> tb_found there is:
>>>>
>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>> (tb->fastreuseport > 0 &&
>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>> sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>>>> smallest_size == -1)
>>>> goto success;
>>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
>>>> true)) {
>>>> if ((reuse ||
>>>> (tb->fastreuseport > 0 &&
>>>> sk->sk_reuseport &&
>>>> !rcu_access_pointer(sk->sk_reuseport_cb)
>>>> &&
>>>> uid_eq(tb->fastuid, uid))) &&
>>>> smallest_size != -1 && --attempts >= 0) {
>>>> spin_unlock_bh(&head->lock);
>>>> goto again;
>>>> }
>>>> goto fail_unlock;
>>>> }
>>>>
>>>> AFAICT there is redundancy in these two conditionals. The same clause
>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>>>> first conditional should be hit, goto done, and the second will never
>>>> evaluate that part to true-- unless the sk is changed (do we need
>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>
>>> That's an interesting point... It looks like this function also
>>> changed in 4.6 from using a single local_bh_disable() at the beginning
>>> with several spin_lock(&head->lock) to exclusively
>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh
>>> disable variant was preventing the timers in your stack trace from
>>> running interleaved with this function before?
>>
>>
>> Could be, although dropping the lock shouldn't be able to affect the
>> search state. TBH, I'm a little lost in reading function, the
>> SO_REUSEPORT handling is pretty complicated. For instance,
>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>> function and also in every call to inet_csk_bind_conflict. I wonder if
>> we can simply this under the assumption that SO_REUSEPORT is only
>> allowed if the port number (snum) is explicitly specified.
>
>
> Ok first I have data for you Hannes, here's the time distributions before
> during and after the lockup (with all the debugging in place the box
> eventually recovers). I've attached it as a text file since it is long.
>
> Second is I was thinking about why we would spend so much time doing the
> ->owners list, and obviously it's because of the massive amount of timewait
> sockets on the owners list. I wrote the following dumb patch and tested it
> and the problem has disappeared completely. Now I don't know if this is
> right at all, but I thought it was weird we weren't copying the soreuseport
> option from the original socket onto the twsk. Is there are reason we
> aren't doing this currently? Does this help explain what is happening?
> Thanks,
>
I think that would explain it. We would be walking long lists of TW
sockets in inet_bind_bucket_for_each(tb, &head->chain). This should
break, although now I'm wondering if there's other ways we can get
into this situation. reuseport ensures that we can have long lists of
sockets in a single bucket, TW sockets can make that list really long.
Tom
> Josef
Powered by blists - more mailing lists