[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e32b5375-38fb-1fd8-d97a-bf740e368205@stressinduktion.org>
Date: Sat, 17 Dec 2016 12:08:52 +0100
From: Hannes Frederic Sowa <hannes@...essinduktion.org>
To: Josef Bacik <jbacik@...com>, Tom Herbert <tom@...bertland.com>
Cc: Craig Gallek <kraigatgoog@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@...bertland.com> wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@...com> wrote:
>>> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@...com> wrote:
>>>>
>>>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@...com> wrote:
>>>>>
>>>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>> <hannes@...essinduktion.org> wrote:
>>>>>>
>>>>>> Hi Josef,
>>>>>>
>>>>>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>
>>>>>>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@...bertland.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>> <kraigatgoog@...il.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>> <tom@...bertland.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I think there may be some suspicious code in
>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>> tb_found there is:
>>>>>>>>>>
>>>>>>>>>> if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>>
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>> uid))) &&
>>>>>>>>>> smallest_size == -1)
>>>>>>>>>> goto success;
>>>>>>>>>> if
>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>> tb, true)) {
>>>>>>>>>> if ((reuse ||
>>>>>>>>>> (tb->fastreuseport > 0 &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>> smallest_size != -1 &&
>>>>>>>>>> --attempts >=
>>>>>>>>>> 0) {
>>>>>>>>>> spin_unlock_bh(&head->lock);
>>>>>>>>>> goto again;
>>>>>>>>>> }
>>>>>>>>>> goto fail_unlock;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> AFAICT there is redundancy in these two conditionals. The
>>>>>>>>>> same
>>>>>>>>>> clause
>>>>>>>>>> is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>> is true
>>>>>>>>>> the
>>>>>>>>>> first conditional should be hit, goto done, and the second
>>>>>>>>>> will
>>>>>>>>>> never
>>>>>>>>>> evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>> we need
>>>>>>>>>> READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>
>>>>>>>>> That's an interesting point... It looks like this function also
>>>>>>>>> changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>> beginning
>>>>>>>>> with several spin_lock(&head->lock) to exclusively
>>>>>>>>> spin_lock_bh(&head->lock) at each locking point. Perhaps
>>>>>>>>> the full
>>>>>>>>> bh
>>>>>>>>> disable variant was preventing the timers in your stack
>>>>>>>>> trace from
>>>>>>>>> running interleaved with this function before?
>>>>>>>>
>>>>>>>>
>>>>>>>> Could be, although dropping the lock shouldn't be able to
>>>>>>>> affect the
>>>>>>>> search state. TBH, I'm a little lost in reading function, the
>>>>>>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>> that
>>>>>>>> function and also in every call to inet_csk_bind_conflict. I
>>>>>>>> wonder
>>>>>>>> if
>>>>>>>> we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>> allowed if the port number (snum) is explicitly specified.
>>>>>>>
>>>>>>>
>>>>>>> Ok first I have data for you Hannes, here's the time distributions
>>>>>>> before during and after the lockup (with all the debugging in
>>>>>>> place
>>>>>>> the
>>>>>>> box eventually recovers). I've attached it as a text file
>>>>>>> since it is
>>>>>>> long.
>>>>>>
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>>> Second is I was thinking about why we would spend so much time
>>>>>>> doing
>>>>>>> the
>>>>>>> ->owners list, and obviously it's because of the massive amount of
>>>>>>> timewait sockets on the owners list. I wrote the following
>>>>>>> dumb patch
>>>>>>> and tested it and the problem has disappeared completely. Now
>>>>>>> I don't
>>>>>>> know if this is right at all, but I thought it was weird we
>>>>>>> weren't
>>>>>>> copying the soreuseport option from the original socket onto
>>>>>>> the twsk.
>>>>>>> Is there are reason we aren't doing this currently? Does this
>>>>>>> help
>>>>>>> explain what is happening? Thanks,
>>>>>>
>>>>>>
>>>>>> The patch is interesting and a good clue, but I am immediately a bit
>>>>>> concerned that we don't copy/tag the socket with the uid also to
>>>>>> keep
>>>>>> the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>> about this.
>>>>>>
>>>>>> We have seen hangs during connect. I am afraid this patch
>>>>>> wouldn't help
>>>>>> there while also guaranteeing uniqueness.
>>>>>
>>>>>
>>>>>
>>>>> Yeah so I looked at the code some more and actually my patch is
>>>>> really
>>>>> bad. If sk2->sk_reuseport is set we'll look at
>>>>> sk2->sk_reuseport_cb, which
>>>>> is outside of the timewait sock, so that's definitely bad.
>>>>>
>>>>> But we should at least be setting it to 0 so that we don't do this
>>>>> normally. Unfortunately simply setting it to 0 doesn't fix the
>>>>> problem. So
>>>>> for some reason having ->sk_reuseport set to 1 on a timewait
>>>>> socket makes
>>>>> this problem non-existent, which is strange.
>>>>>
>>>>> So back to the drawing board I guess. I wonder if doing what craig
>>>>> suggested and batching the timewait timer expires so it hurts less
>>>>> would
>>>>> accomplish the same results. Thanks,
>>>>
>>>>
>>>> Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's. This
>>>> is the
>>>> code
>>>>
>>>> if ((!reuse || !sk2->sk_reuse ||
>>>> sk2->sk_state == TCP_LISTEN) &&
>>>> (!reuseport || !sk2->sk_reuseport ||
>>>> rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>> (sk2->sk_state != TCP_TIME_WAIT &&
>>>> !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>
>>>> if (!sk2->sk_rcv_saddr ||
>>>> !sk->sk_rcv_saddr
>>>> ||
>>>> sk2->sk_rcv_saddr ==
>>>> sk->sk_rcv_saddr)
>>>> break;
>>>> }
>>>>
>>>> so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>> == 1.
>>>> But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>> non-NULL
>>>> right? So really setting the timewait sock's sk_reuseport should
>>>> have no
>>>> bearing on how this loop plays out right? Thanks,
>>>
>>>
>>>
>>> So more messing around and I noticed that we basically don't do the
>>> tb->fastreuseport logic at all if we've ended up with a non
>>> SO_REUSEPORT
>>> socket on that tb. So before I fully understood what I was doing I
>>> fixed it
>>> so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>> socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>> uid of
>>> the socket. This made the problem go away. Tom pointed out that if
>>> we bind
>>> to the same port on a different address and we have a non SO_REUSEPORT
>>> socket with the same address on this tb then we'd be screwed with my
>>> code.
>>>
>>> Which brings me to his proposed solution. We need another hash
>>> table that
>>> is indexed based on the binding address. Then each node corresponds
>>> to one
>>> address/port binding, with non-SO_REUSEPORT entries having only one
>>> entry,
>>> and normal SO_REUSEPORT entries having many. This cleans up the
>>> need to
>>> search all the possible sockets on any given tb, we just go and look
>>> at the
>>> one we care about. Does this make sense? Thanks,
>>>
>> Hi Josef,
>>
>> Thinking about it some more the hash table won't work because of the
>> rules of binding different addresses to the same port. What I think we
>> can do is to change inet_bind_bucket to be structure that contains all
>> the information used to detect conflicts (reuse*, if, address, uid,
>> etc.) and a list of sockets that share that exact same information--
>> for instance all socket in timewait state create through some listener
>> socket should wind up on single bucket. When we do the bind_conflict
>> function we only should have to walk this buckets, not the full list
>> of sockets.
>>
>> Thoughts on this?
>
> This sounds good, maybe tb->owners be a list of say
>
> struct inet_unique_shit {
> struct sock_common sk;
> struct hlist socks;
> };
>
> Then we make inet_unique_shit like twsks', just copy the relevant
> information, then hang the real sockets off of the socks hlist.
> Something like that? Thanks,
As a counter idea: can we push up a flag to the inet_bind_bucket that we
check in the fast- way style that indicates that we have wildcarded
non-reuse(port) in there, so we can skip the bind_bucket much more
quickly? This wouldn't require a new data structure.
Thanks,
Hannes
Powered by blists - more mailing lists