[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <067589C4-361F-49FE-B493-83BC0EC38277@fb.com>
Date: Tue, 20 Dec 2016 04:59:22 +0000
From: Josef Bacik <jbacik@...com>
To: Eric Dumazet <eric.dumazet@...il.com>
CC: Tom Herbert <tom@...bertland.com>,
David Miller <davem@...emloft.net>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Craig Gallek <kraigatgoog@...il.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Soft lockup in inet_put_port on 4.6
> On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>
> On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>>>
>>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>>>>
>>>> When sockets created SO_REUSEPORT move to TW state they are placed
>>>> back on the the tb->owners. fastreuse port is no longer set so we have
>>>> to walk potential long list of sockets in tb->owners to open a new
>>>> listener socket. I imagine this is happens when we try to open a new
>>>> listener SO_REUSEPORT after the system has been running a while and so
>>>> we hit the long tb->owners list.
>>>
>>> Hmm... __inet_twsk_hashdance() does not change tb->fastreuse
>>>
>>> So where tb->fastreuse is cleared ?
>>>
>>> If all your sockets have SO_REUSEPORT set, this should not happen.
>>>
>>
>> The app starts out with no SO_REUSEPORT, and then we restart it with
>> that option enabled.
>
> But... why would the application do this dance ?
>
> I now better understand why we never had these issues...
>
It doesn't do it as a part of it's normal operation. The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom. They immediately stopped doing anything and gave it to me to figure out.
>
>> What I suspect is we have all the twsks from the original service,
>> and the fastreuse stuff is cleared. My naive patch resets it once we
>> add a reuseport sk to the tb and that makes the problem go away. I'm
>> reworking all of this logic and adding some extra info to the tb to
>> make the reset actually safe. I'll send those patches out tomorrow.
>> Thanks,
>
> Okay, we will review them ;)
>
> Note that Willy Tarreau wants some mechanism to be able to freeze a
> listener, to allow haproxy to be replaced without closing any sessions.
>
I assume that's what these guys would want as well. They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does. Thanks,
Josef
Powered by blists - more mailing lists