linux-kernel - Re: [RFC PATCH bpf-next 0/8] Socket migration for SO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201120023157.immjndtw4hgcyz75@kafai-mbp.dhcp.thefacebook.com>
Date:   Thu, 19 Nov 2020 18:31:57 -0800
From:   Martin KaFai Lau <kafai@...com>
To:     Kuniyuki Iwashima <kuniyu@...zon.co.jp>
CC:     <ast@...nel.org>, <benh@...zon.com>, <bpf@...r.kernel.org>,
        <daniel@...earbox.net>, <davem@...emloft.net>,
        <edumazet@...gle.com>, <kuba@...nel.org>, <kuni1840@...il.com>,
        <linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>
Subject: Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.

On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@...com>
> Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation. When a SYN packet is received, the connection is tied to a
> > > listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners could accept such connections.
> > > 
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the new
> > > configuration and closes all listeners on the old workers, resulting in
> > > in-flight ACK of 3WHS is responded by RST.
> > > 
> > > As a workaround for this issue, we can do connection draining by eBPF:
> > > 
> > >   1. Before closing a listener, stop routing SYN packets to it.
> > >   2. Wait enough time for requests to complete 3WHS.
> > >   3. Accept connections until EAGAIN, then close the listener.
> > > 
> > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > how many requests are still during 3WHS. Thus, we have to know the number
> > It sounds like the application can already drain the established socket
> > by accept()?  To solve the problem that you have,
> > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> 
> Ideally, the application needs to drain only the accepted sockets because
> 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> there are some cases where we want to apply new configurations as soon as
> possible such as replacing TLS certificates.
> 
> It is possible to drain the established sockets by accept(), but the
> sockets in the accept queue have not started application sessions yet. So,
> if we do not drain such sockets (or if the kernel happened to select
> another listener), we can (could) apply the new settings much earlier.
> 
> Moreover, the established sockets may start long-standing connections so
> that we cannot complete draining for a long time and may have to
> force-close them (and they would have longer lifetime if they are migrated
> to a new listener).
> 
> 
> > Applications can already use the bpf prog to do (1) and divert
> > the SYN to the newly started process.
> > 
> > If the application cares about service disruption,
> > it usually needs to drain the fd(s) that it already has and
> > finishes serving the pending request (e.g. https) on them anyway.
> > The time taking to finish those could already be longer than it takes
> > to drain the accept queue or finish off the 3WHS in reasonable time.
> > or the application that you have does not need to drain the fd(s) 
> > it already has and it can close them immediately?
> 
> In the point of view of service disruption, I agree with you.
> 
> However, I think that there are some situations where we want to apply new
> configurations rather than to drain sockets with old configurations and
> that if the kernel migrates sockets automatically, we can simplify user
> programs.
This configuration-update(/new-TLS-cert...etc) consideration will be useful
if it is also included in the cover letter.

It sounds like the service that you have is draining the existing
already-accepted fd(s) which are using the old configuration.
Those existing fd(s) could also be long life.  Potentially those
existing fd(s) will be in a much higher number than the
to-be-accepted fd(s)?

or you meant in some cases it wants to migrate to the new configuration
ASAP (e.g. for security reason) even it has to close all the
already-accepted fds() which are using the old configuration??

In either cases, considering the already-accepted fd(s)
is usually in a much more number, does the to-be-accepted
connection make any difference percentage-wise?