netdev - Re: Soft lockup in inet_put

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e32b5375-38fb-1fd8-d97a-bf740e368205@stressinduktion.org>
Date:   Sat, 17 Dec 2016 12:08:52 +0100
From:   Hannes Frederic Sowa <hannes@...essinduktion.org>
To:     Josef Bacik <jbacik@...com>, Tom Herbert <tom@...bertland.com>
Cc:     Craig Gallek <kraigatgoog@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6

On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <tom@...bertland.com> wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jbacik@...com> wrote:
>>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jbacik@...com> wrote:
>>>>
>>>>  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@...com> wrote:
>>>>>
>>>>>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>>>  <hannes@...essinduktion.org> wrote:
>>>>>>
>>>>>>  Hi Josef,
>>>>>>
>>>>>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>>>>>
>>>>>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@...bertland.com>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
>>>>>>>> <kraigatgoog@...il.com>
>>>>>>>>   wrote:
>>>>>>>>>
>>>>>>>>>    On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>>>>>>>>> <tom@...bertland.com>
>>>>>>>>>   wrote:
>>>>>>>>>>
>>>>>>>>>>    I think there may be some suspicious code in
>>>>>>>>>> inet_csk_get_port. At
>>>>>>>>>>    tb_found there is:
>>>>>>>>>>
>>>>>>>>>>                    if (((tb->fastreuse > 0 && reuse) ||
>>>>>>>>>>                         (tb->fastreuseport > 0 &&
>>>>>>>>>>                         
>>>>>>>>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>                          sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>>>>>   uid))) &&
>>>>>>>>>>                        smallest_size == -1)
>>>>>>>>>>                            goto success;
>>>>>>>>>>                    if
>>>>>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>>>>>   tb, true)) {
>>>>>>>>>>                            if ((reuse ||
>>>>>>>>>>                                 (tb->fastreuseport > 0 &&
>>>>>>>>>>                                  sk->sk_reuseport &&
>>>>>>>>>>
>>>>>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>>                                  uid_eq(tb->fastuid, uid))) &&
>>>>>>>>>>                                smallest_size != -1 &&
>>>>>>>>>> --attempts >=
>>>>>>>>>>  0) {
>>>>>>>>>>                                    spin_unlock_bh(&head->lock);
>>>>>>>>>>                                    goto again;
>>>>>>>>>>                            }
>>>>>>>>>>                            goto fail_unlock;
>>>>>>>>>>                    }
>>>>>>>>>>
>>>>>>>>>>    AFAICT there is redundancy in these two conditionals.  The
>>>>>>>>>> same
>>>>>>>>>>  clause
>>>>>>>>>>    is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>>>>>    !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>>>>> sk->sk_reuseport &&
>>>>>>>>>>    uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>>>>>>>>> is true
>>>>>>>>>>  the
>>>>>>>>>>    first conditional should be hit, goto done,  and the second
>>>>>>>>>> will
>>>>>>>>>>  never
>>>>>>>>>>    evaluate that part to true-- unless the sk is changed (do
>>>>>>>>>> we need
>>>>>>>>>>    READ_ONCE for sk->sk_reuseport_cb?).
>>>>>>>>>
>>>>>>>>>    That's an interesting point... It looks like this function also
>>>>>>>>>    changed in 4.6 from using a single local_bh_disable() at the
>>>>>>>>>  beginning
>>>>>>>>>    with several spin_lock(&head->lock) to exclusively
>>>>>>>>>    spin_lock_bh(&head->lock) at each locking point.  Perhaps
>>>>>>>>> the full
>>>>>>>>>  bh
>>>>>>>>>    disable variant was preventing the timers in your stack
>>>>>>>>> trace from
>>>>>>>>>    running interleaved with this function before?
>>>>>>>>
>>>>>>>>
>>>>>>>>   Could be, although dropping the lock shouldn't be able to
>>>>>>>> affect the
>>>>>>>>   search state. TBH, I'm a little lost in reading function, the
>>>>>>>>   SO_REUSEPORT handling is pretty complicated. For instance,
>>>>>>>>   rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>>>>>>>>  that
>>>>>>>>   function and also in every call to inet_csk_bind_conflict. I
>>>>>>>> wonder
>>>>>>>>  if
>>>>>>>>   we can simply this under the assumption that SO_REUSEPORT is only
>>>>>>>>   allowed if the port number (snum) is explicitly specified.
>>>>>>>
>>>>>>>
>>>>>>>   Ok first I have data for you Hannes, here's the time distributions
>>>>>>>   before during and after the lockup (with all the debugging in
>>>>>>> place
>>>>>>>  the
>>>>>>>   box eventually recovers).  I've attached it as a text file
>>>>>>> since it is
>>>>>>>   long.
>>>>>>
>>>>>>
>>>>>>  Thanks a lot!
>>>>>>
>>>>>>>   Second is I was thinking about why we would spend so much time
>>>>>>> doing
>>>>>>>  the
>>>>>>>   ->owners list, and obviously it's because of the massive amount of
>>>>>>>   timewait sockets on the owners list.  I wrote the following
>>>>>>> dumb patch
>>>>>>>   and tested it and the problem has disappeared completely.  Now
>>>>>>> I don't
>>>>>>>   know if this is right at all, but I thought it was weird we
>>>>>>> weren't
>>>>>>>   copying the soreuseport option from the original socket onto
>>>>>>> the twsk.
>>>>>>>   Is there are reason we aren't doing this currently?  Does this
>>>>>>> help
>>>>>>>   explain what is happening?  Thanks,
>>>>>>
>>>>>>
>>>>>>  The patch is interesting and a good clue, but I am immediately a bit
>>>>>>  concerned that we don't copy/tag the socket with the uid also to
>>>>>> keep
>>>>>>  the security properties for SO_REUSEPORT. I have to think a bit more
>>>>>>  about this.
>>>>>>
>>>>>>  We have seen hangs during connect. I am afraid this patch
>>>>>> wouldn't help
>>>>>>  there while also guaranteeing uniqueness.
>>>>>
>>>>>
>>>>>
>>>>>  Yeah so I looked at the code some more and actually my patch is
>>>>> really
>>>>>  bad.  If sk2->sk_reuseport is set we'll look at
>>>>> sk2->sk_reuseport_cb, which
>>>>>  is outside of the timewait sock, so that's definitely bad.
>>>>>
>>>>>  But we should at least be setting it to 0 so that we don't do this
>>>>>  normally.  Unfortunately simply setting it to 0 doesn't fix the
>>>>> problem.  So
>>>>>  for some reason having ->sk_reuseport set to 1 on a timewait
>>>>> socket makes
>>>>>  this problem non-existent, which is strange.
>>>>>
>>>>>  So back to the drawing board I guess.  I wonder if doing what craig
>>>>>  suggested and batching the timewait timer expires so it hurts less
>>>>> would
>>>>>  accomplish the same results.  Thanks,
>>>>
>>>>
>>>>  Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This
>>>> is the
>>>>  code
>>>>
>>>>                         if ((!reuse || !sk2->sk_reuse ||
>>>>                             sk2->sk_state == TCP_LISTEN) &&
>>>>                             (!reuseport || !sk2->sk_reuseport ||
>>>>                              rcu_access_pointer(sk->sk_reuseport_cb) ||
>>>>                              (sk2->sk_state != TCP_TIME_WAIT &&
>>>>                              !uid_eq(uid, sock_i_uid(sk2))))) {
>>>>
>>>>                                 if (!sk2->sk_rcv_saddr ||
>>>> !sk->sk_rcv_saddr
>>>>  ||
>>>>                                     sk2->sk_rcv_saddr ==
>>>> sk->sk_rcv_saddr)
>>>>                                         break;
>>>>                         }
>>>>
>>>>  so in my patches case we now have reuseport == 1, sk2->sk_reuseport
>>>> == 1.
>>>>  But now we are using reuseport, so sk->sk_reuseport_cb should be
>>>> non-NULL
>>>>  right?  So really setting the timewait sock's sk_reuseport should
>>>> have no
>>>>  bearing on how this loop plays out right?  Thanks,
>>>
>>>
>>>
>>>  So more messing around and I noticed that we basically don't do the
>>>  tb->fastreuseport logic at all if we've ended up with a non
>>> SO_REUSEPORT
>>>  socket on that tb.  So before I fully understood what I was doing I
>>> fixed it
>>>  so that after we go through ->bind_conflict() once with a SO_REUSEPORT
>>>  socket, we reset tb->fastreuseport to 1 and set the uid to match the
>>> uid of
>>>  the socket.  This made the problem go away.  Tom pointed out that if
>>> we bind
>>>  to the same port on a different address and we have a non SO_REUSEPORT
>>>  socket with the same address on this tb then we'd be screwed with my
>>> code.
>>>
>>>  Which brings me to his proposed solution.  We need another hash
>>> table that
>>>  is indexed based on the binding address.  Then each node corresponds
>>> to one
>>>  address/port binding, with non-SO_REUSEPORT entries having only one
>>> entry,
>>>  and normal SO_REUSEPORT entries having many.  This cleans up the
>>> need to
>>>  search all the possible sockets on any given tb, we just go and look
>>> at the
>>>  one we care about.  Does this make sense?  Thanks,
>>>
>> Hi Josef,
>>
>> Thinking about it some more the hash table won't work because of the
>> rules of binding different addresses to the same port. What I think we
>> can do is to change inet_bind_bucket to be structure that contains all
>> the information used to detect conflicts (reuse*, if, address, uid,
>> etc.) and a list of sockets that share that exact same information--
>> for instance all socket in timewait state create through some listener
>> socket should wind up on single bucket. When we do the bind_conflict
>> function we only should have to walk this buckets, not the full list
>> of sockets.
>>
>> Thoughts on this?
> 
> This sounds good, maybe tb->owners be a list of say
> 
> struct inet_unique_shit {
>     struct sock_common sk;
>     struct hlist socks;
> };
> 
> Then we make inet_unique_shit like twsks', just copy the relevant
> information, then hang the real sockets off of the socks hlist. 
> Something like that?  Thanks,

As a counter idea: can we push up a flag to the inet_bind_bucket that we
check in the fast- way style that indicates that we have wildcarded
non-reuse(port) in there, so we can skip the bind_bucket much more
quickly? This wouldn't require a new data structure.

Thanks,
Hannes