[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1481901684.24490.7@smtp.office365.com>
Date: Fri, 16 Dec 2016 10:21:24 -0500
From: Josef Bacik <jbacik@...com>
To: Hannes Frederic Sowa <hannes@...essinduktion.org>
CC: Tom Herbert <tom@...bertland.com>,
Craig Gallek <kraigatgoog@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@...com> wrote:
> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
> <hannes@...essinduktion.org> wrote:
>> Hi Josef,
>>
>> On 15.12.2016 19:53, Josef Bacik wrote:
>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@...bertland.com>
>>> wrote:
>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>> <kraigatgoog@...il.com>
>>>> wrote:
>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>> <tom@...bertland.com>
>>>>> wrote:
>>>>>> I think there may be some suspicious code in
>>>>>> inet_csk_get_port. At
>>>>>> tb_found there is:
>>>>>>
>>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>>> (tb->fastreuseport > 0 &&
>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb)
>>>>>> &&
>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>> uid))) &&
>>>>>> smallest_size == -1)
>>>>>> goto success;
>>>>>> if
>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>> tb, true)) {
>>>>>> if ((reuse ||
>>>>>> (tb->fastreuseport > 0 &&
>>>>>> sk->sk_reuseport &&
>>>>>>
>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>>> smallest_size != -1 && --attempts
>>>>>> >= 0) {
>>>>>> spin_unlock_bh(&head->lock);
>>>>>> goto again;
>>>>>> }
>>>>>> goto fail_unlock;
>>>>>> }
>>>>>>
>>>>>> AFAICT there is redundancy in these two conditionals. The
>>>>>> same clause
>>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is
>>>>>> true the
>>>>>> first conditional should be hit, goto done, and the second
>>>>>> will never
>>>>>> evaluate that part to true-- unless the sk is changed (do we
>>>>>> need
>>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>>> That's an interesting point... It looks like this function also
>>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>>> beginning
>>>>> with several spin_lock(&head->lock) to exclusively
>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps the
>>>>> full bh
>>>>> disable variant was preventing the timers in your stack trace
>>>>> from
>>>>> running interleaved with this function before?
>>>>
>>>> Could be, although dropping the lock shouldn't be able to affect
>>>> the
>>>> search state. TBH, I'm a little lost in reading function, the
>>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>> that
>>>> function and also in every call to inet_csk_bind_conflict. I
>>>> wonder if
>>>> we can simply this under the assumption that SO_REUSEPORT is only
>>>> allowed if the port number (snum) is explicitly specified.
>>>
>>> Ok first I have data for you Hannes, here's the time distributions
>>> before during and after the lockup (with all the debugging in
>>> place the
>>> box eventually recovers). I've attached it as a text file since
>>> it is
>>> long.
>>
>> Thanks a lot!
>>
>>> Second is I was thinking about why we would spend so much time
>>> doing the
>>> ->owners list, and obviously it's because of the massive amount of
>>> timewait sockets on the owners list. I wrote the following dumb
>>> patch
>>> and tested it and the problem has disappeared completely. Now I
>>> don't
>>> know if this is right at all, but I thought it was weird we weren't
>>> copying the soreuseport option from the original socket onto the
>>> twsk.
>>> Is there are reason we aren't doing this currently? Does this help
>>> explain what is happening? Thanks,
>>
>> The patch is interesting and a good clue, but I am immediately a bit
>> concerned that we don't copy/tag the socket with the uid also to keep
>> the security properties for SO_REUSEPORT. I have to think a bit more
>> about this.
>>
>> We have seen hangs during connect. I am afraid this patch wouldn't
>> help
>> there while also guaranteeing uniqueness.
>
>
> Yeah so I looked at the code some more and actually my patch is
> really bad. If sk2->sk_reuseport is set we'll look at
> sk2->sk_reuseport_cb, which is outside of the timewait sock, so
> that's definitely bad.
>
> But we should at least be setting it to 0 so that we don't do this
> normally. Unfortunately simply setting it to 0 doesn't fix the
> problem. So for some reason having ->sk_reuseport set to 1 on a
> timewait socket makes this problem non-existent, which is strange.
>
> So back to the drawing board I guess. I wonder if doing what craig
> suggested and batching the timewait timer expires so it hurts less
> would accomplish the same results. Thanks,
Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This is
the code
if ((!reuse || !sk2->sk_reuse ||
sk2->sk_state == TCP_LISTEN) &&
(!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
(sk2->sk_state != TCP_TIME_WAIT &&
!uid_eq(uid, sock_i_uid(sk2))))) {
if (!sk2->sk_rcv_saddr ||
!sk->sk_rcv_saddr ||
sk2->sk_rcv_saddr ==
sk->sk_rcv_saddr)
break;
}
so in my patches case we now have reuseport == 1, sk2->sk_reuseport ==
1. But now we are using reuseport, so sk->sk_reuseport_cb should be
non-NULL right? So really setting the timewait sock's sk_reuseport
should have no bearing on how this loop plays out right? Thanks,
Josef
Powered by blists - more mailing lists