[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1481928610.17731.0@smtp.office365.com>
Date: Fri, 16 Dec 2016 17:50:10 -0500
From: Josef Bacik <jbacik@...com>
To: Tom Herbert <tom@...bertland.com>
CC: Hannes Frederic Sowa <hannes@...essinduktion.org>,
Craig Gallek <kraigatgoog@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>,
"Linux Kernel Network Developers" <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@...bertland.com>
wrote:
> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@...com> wrote:
>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@...com> wrote:
>>>
>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@...com> wrote:
>>>>
>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>> <hannes@...essinduktion.org> wrote:
>>>>>
>>>>> Hi Josef,
>>>>>
>>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>
>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert
>>>>>> <tom@...bertland.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>> <kraigatgoog@...il.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>> <tom@...bertland.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I think there may be some suspicious code in
>>>>>>>>> inet_csk_get_port. At
>>>>>>>>> tb_found there is:
>>>>>>>>>
>>>>>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>> uid_eq(tb->fastuid,
>>>>>>>>> uid))) &&
>>>>>>>>> smallest_size == -1)
>>>>>>>>> goto success;
>>>>>>>>> if
>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>> tb, true)) {
>>>>>>>>> if ((reuse ||
>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>>>>>> smallest_size != -1 &&
>>>>>>>>> --attempts >=
>>>>>>>>> 0) {
>>>>>>>>>
>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>> goto again;
>>>>>>>>> }
>>>>>>>>> goto fail_unlock;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> AFAICT there is redundancy in these two conditionals. The
>>>>>>>>> same
>>>>>>>>> clause
>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>> is true
>>>>>>>>> the
>>>>>>>>> first conditional should be hit, goto done, and the
>>>>>>>>> second will
>>>>>>>>> never
>>>>>>>>> evaluate that part to true-- unless the sk is changed (do
>>>>>>>>> we need
>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>
>>>>>>>> That's an interesting point... It looks like this function
>>>>>>>> also
>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>> beginning
>>>>>>>> with several spin_lock(&head->lock) to exclusively
>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps
>>>>>>>> the full
>>>>>>>> bh
>>>>>>>> disable variant was preventing the timers in your stack
>>>>>>>> trace from
>>>>>>>> running interleaved with this function before?
>>>>>>>
>>>>>>>
>>>>>>> Could be, although dropping the lock shouldn't be able to
>>>>>>> affect the
>>>>>>> search state. TBH, I'm a little lost in reading function, the
>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three
>>>>>>> times in
>>>>>>> that
>>>>>>> function and also in every call to inet_csk_bind_conflict. I
>>>>>>> wonder
>>>>>>> if
>>>>>>> we can simply this under the assumption that SO_REUSEPORT is
>>>>>>> only
>>>>>>> allowed if the port number (snum) is explicitly specified.
>>>>>>
>>>>>>
>>>>>> Ok first I have data for you Hannes, here's the time
>>>>>> distributions
>>>>>> before during and after the lockup (with all the debugging in
>>>>>> place
>>>>>> the
>>>>>> box eventually recovers). I've attached it as a text file
>>>>>> since it is
>>>>>> long.
>>>>>
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>>> Second is I was thinking about why we would spend so much time
>>>>>> doing
>>>>>> the
>>>>>> ->owners list, and obviously it's because of the massive
>>>>>> amount of
>>>>>> timewait sockets on the owners list. I wrote the following
>>>>>> dumb patch
>>>>>> and tested it and the problem has disappeared completely. Now
>>>>>> I don't
>>>>>> know if this is right at all, but I thought it was weird we
>>>>>> weren't
>>>>>> copying the soreuseport option from the original socket onto
>>>>>> the twsk.
>>>>>> Is there are reason we aren't doing this currently? Does this
>>>>>> help
>>>>>> explain what is happening? Thanks,
>>>>>
>>>>>
>>>>> The patch is interesting and a good clue, but I am immediately a
>>>>> bit
>>>>> concerned that we don't copy/tag the socket with the uid also to
>>>>> keep
>>>>> the security properties for SO_REUSEPORT. I have to think a bit
>>>>> more
>>>>> about this.
>>>>>
>>>>> We have seen hangs during connect. I am afraid this patch
>>>>> wouldn't help
>>>>> there while also guaranteeing uniqueness.
>>>>
>>>>
>>>>
>>>> Yeah so I looked at the code some more and actually my patch is
>>>> really
>>>> bad. If sk2->sk_reuseport is set we'll look at
>>>> sk2->sk_reuseport_cb, which
>>>> is outside of the timewait sock, so that's definitely bad.
>>>>
>>>> But we should at least be setting it to 0 so that we don't do this
>>>> normally. Unfortunately simply setting it to 0 doesn't fix the
>>>> problem. So
>>>> for some reason having ->sk_reuseport set to 1 on a timewait
>>>> socket makes
>>>> this problem non-existent, which is strange.
>>>>
>>>> So back to the drawing board I guess. I wonder if doing what
>>>> craig
>>>> suggested and batching the timewait timer expires so it hurts
>>>> less would
>>>> accomplish the same results. Thanks,
>>>
>>>
>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.
>>> This is the
>>> code
>>>
>>> if ((!reuse || !sk2->sk_reuse ||
>>> sk2->sk_state == TCP_LISTEN) &&
>>> (!reuseport || !sk2->sk_reuseport ||
>>>
>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>> (sk2->sk_state != TCP_TIME_WAIT &&
>>> !uid_eq(uid, sock_i_uid(sk2))))) {
>>>
>>> if (!sk2->sk_rcv_saddr ||
>>> !sk->sk_rcv_saddr
>>> ||
>>> sk2->sk_rcv_saddr ==
>>> sk->sk_rcv_saddr)
>>> break;
>>> }
>>>
>>> so in my patches case we now have reuseport == 1,
>>> sk2->sk_reuseport == 1.
>>> But now we are using reuseport, so sk->sk_reuseport_cb should be
>>> non-NULL
>>> right? So really setting the timewait sock's sk_reuseport should
>>> have no
>>> bearing on how this loop plays out right? Thanks,
>>
>>
>>
>> So more messing around and I noticed that we basically don't do the
>> tb->fastreuseport logic at all if we've ended up with a non
>> SO_REUSEPORT
>> socket on that tb. So before I fully understood what I was doing I
>> fixed it
>> so that after we go through ->bind_conflict() once with a
>> SO_REUSEPORT
>> socket, we reset tb->fastreuseport to 1 and set the uid to match
>> the uid of
>> the socket. This made the problem go away. Tom pointed out that
>> if we bind
>> to the same port on a different address and we have a non
>> SO_REUSEPORT
>> socket with the same address on this tb then we'd be screwed with
>> my code.
>>
>> Which brings me to his proposed solution. We need another hash
>> table that
>> is indexed based on the binding address. Then each node
>> corresponds to one
>> address/port binding, with non-SO_REUSEPORT entries having only one
>> entry,
>> and normal SO_REUSEPORT entries having many. This cleans up the
>> need to
>> search all the possible sockets on any given tb, we just go and
>> look at the
>> one we care about. Does this make sense? Thanks,
>>
> Hi Josef,
>
> Thinking about it some more the hash table won't work because of the
> rules of binding different addresses to the same port. What I think we
> can do is to change inet_bind_bucket to be structure that contains all
> the information used to detect conflicts (reuse*, if, address, uid,
> etc.) and a list of sockets that share that exact same information--
> for instance all socket in timewait state create through some listener
> socket should wind up on single bucket. When we do the bind_conflict
> function we only should have to walk this buckets, not the full list
> of sockets.
>
> Thoughts on this?
This sounds good, maybe tb->owners be a list of say
struct inet_unique_shit {
struct sock_common sk;
struct hlist socks;
};
Then we make inet_unique_shit like twsks', just copy the relevant
information, then hang the real sockets off of the socks hlist.
Something like that? Thanks,
Josef
Powered by blists - more mailing lists