netdev - Re: Soft lockup in inet_put

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1481901684.24490.7@smtp.office365.com>
Date:   Fri, 16 Dec 2016 10:21:24 -0500
From:   Josef Bacik <jbacik@...com>
To:     Hannes Frederic Sowa <hannes@...essinduktion.org>
CC:     Tom Herbert <tom@...bertland.com>,
        Craig Gallek <kraigatgoog@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6

On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jbacik@...com> wrote:
> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
> <hannes@...essinduktion.org> wrote:
>> Hi Josef,
>> 
>> On 15.12.2016 19:53, Josef Bacik wrote:
>>>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <tom@...bertland.com> 
>>> wrote:
>>>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>>> <kraigatgoog@...il.com>
>>>>  wrote:
>>>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>>> <tom@...bertland.com>
>>>>>  wrote:
>>>>>>   I think there may be some suspicious code in 
>>>>>> inet_csk_get_port. At
>>>>>>   tb_found there is:
>>>>>> 
>>>>>>                   if (((tb->fastreuse > 0 && reuse) ||
>>>>>>                        (tb->fastreuseport > 0 &&
>>>>>>                         !rcu_access_pointer(sk->sk_reuseport_cb) 
>>>>>> &&
>>>>>>                         sk->sk_reuseport && uid_eq(tb->fastuid,
>>>>>>  uid))) &&
>>>>>>                       smallest_size == -1)
>>>>>>                           goto success;
>>>>>>                   if 
>>>>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>>>>  tb, true)) {
>>>>>>                           if ((reuse ||
>>>>>>                                (tb->fastreuseport > 0 &&
>>>>>>                                 sk->sk_reuseport &&
>>>>>> 
>>>>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>>>>>                                 uid_eq(tb->fastuid, uid))) &&
>>>>>>                               smallest_size != -1 && --attempts 
>>>>>> >= 0) {
>>>>>>                                   spin_unlock_bh(&head->lock);
>>>>>>                                   goto again;
>>>>>>                           }
>>>>>>                           goto fail_unlock;
>>>>>>                   }
>>>>>> 
>>>>>>   AFAICT there is redundancy in these two conditionals.  The 
>>>>>> same clause
>>>>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>>>>   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>>>>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
>>>>>> true the
>>>>>>   first conditional should be hit, goto done,  and the second 
>>>>>> will never
>>>>>>   evaluate that part to true-- unless the sk is changed (do we 
>>>>>> need
>>>>>>   READ_ONCE for sk->sk_reuseport_cb?).
>>>>>   That's an interesting point... It looks like this function also
>>>>>   changed in 4.6 from using a single local_bh_disable() at the 
>>>>> beginning
>>>>>   with several spin_lock(&head->lock) to exclusively
>>>>>   spin_lock_bh(&head->lock) at each locking point.  Perhaps the 
>>>>> full bh
>>>>>   disable variant was preventing the timers in your stack trace 
>>>>> from
>>>>>   running interleaved with this function before?
>>>> 
>>>>  Could be, although dropping the lock shouldn't be able to affect 
>>>> the
>>>>  search state. TBH, I'm a little lost in reading function, the
>>>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
>>>> that
>>>>  function and also in every call to inet_csk_bind_conflict. I 
>>>> wonder if
>>>>  we can simply this under the assumption that SO_REUSEPORT is only
>>>>  allowed if the port number (snum) is explicitly specified.
>>> 
>>>  Ok first I have data for you Hannes, here's the time distributions
>>>  before during and after the lockup (with all the debugging in 
>>> place the
>>>  box eventually recovers).  I've attached it as a text file since 
>>> it is
>>>  long.
>> 
>> Thanks a lot!
>> 
>>>  Second is I was thinking about why we would spend so much time 
>>> doing the
>>>  ->owners list, and obviously it's because of the massive amount of
>>>  timewait sockets on the owners list.  I wrote the following dumb 
>>> patch
>>>  and tested it and the problem has disappeared completely.  Now I 
>>> don't
>>>  know if this is right at all, but I thought it was weird we weren't
>>>  copying the soreuseport option from the original socket onto the 
>>> twsk.
>>>  Is there are reason we aren't doing this currently?  Does this help
>>>  explain what is happening?  Thanks,
>> 
>> The patch is interesting and a good clue, but I am immediately a bit
>> concerned that we don't copy/tag the socket with the uid also to keep
>> the security properties for SO_REUSEPORT. I have to think a bit more
>> about this.
>> 
>> We have seen hangs during connect. I am afraid this patch wouldn't 
>> help
>> there while also guaranteeing uniqueness.
> 
> 
> Yeah so I looked at the code some more and actually my patch is 
> really bad.  If sk2->sk_reuseport is set we'll look at 
> sk2->sk_reuseport_cb, which is outside of the timewait sock, so 
> that's definitely bad.
> 
> But we should at least be setting it to 0 so that we don't do this 
> normally.  Unfortunately simply setting it to 0 doesn't fix the 
> problem.  So for some reason having ->sk_reuseport set to 1 on a 
> timewait socket makes this problem non-existent, which is strange.
> 
> So back to the drawing board I guess.  I wonder if doing what craig 
> suggested and batching the timewait timer expires so it hurts less 
> would accomplish the same results.  Thanks,

Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is 
the code

                        if ((!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) &&
                            (!reuseport || !sk2->sk_reuseport ||
                             rcu_access_pointer(sk->sk_reuseport_cb) ||
                             (sk2->sk_state != TCP_TIME_WAIT &&
                             !uid_eq(uid, sock_i_uid(sk2))))) {

                                if (!sk2->sk_rcv_saddr || 
!sk->sk_rcv_saddr ||
                                    sk2->sk_rcv_saddr == 
sk->sk_rcv_saddr)
                                        break;
                        }

so in my patches case we now have reuseport == 1, sk2->sk_reuseport == 
1.  But now we are using reuseport, so sk->sk_reuseport_cb should be 
non-NULL right?  So really setting the timewait sock's sk_reuseport 
should have no bearing on how this loop plays out right?  Thanks,

Josef