linux-kernel - Re: [RFC] EADDRINUSE from bind() on application restart after killing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKD=ceuLnhK-zpk3QerpS-FUb_wb_HevkpvsVqGJ_T4NQ@mail.gmail.com>
Date:   Fri, 14 Oct 2022 09:34:47 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Paul Gofman <pgofman@...eweavers.com>
Cc:     Muhammad Usama Anjum <usama.anjum@...labora.com>,
        "open list:NETWORKING [TCP]" <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [RFC] EADDRINUSE from bind() on application restart after killing

On Fri, Oct 14, 2022 at 9:31 AM Paul Gofman <pgofman@...eweavers.com> wrote:
>
> Hello Eric,
>
>      that message was not mine.
>
>      Speaking from the Wine side, we cannot workaround that with
> SO_REUSEADDR. First of all, it is under app control and we can't
> voluntary tweak app's socket settings. Then, app might be intentionally
> not using SO_REUSEADDR to prevent port reuse which of course may be
> harmful (more harmful than failure to restart for another minute). What
> is broken with the application which doesn't want to use SO_REUSEADDR
> and wants to disallow port reuse while it binds to it which reuse will
> surely break it?
>
>      But my present question about the listening socket being not
> reusable while closed due to linked accepeted socket was not related to
> Wine at all. I am not sure how one can fix that in the application if
> they don't really want other applications or another copy of the same
> one to be able to reuse the port they currently bind to? I believe the
> issue with listen socket been not available happens rather often for
> native services and they all have to workaround that. While not related
> here, I also encountered some out-of-tree hacks to tweak the TIME_WAIT
> timeout to tackle this very problem for some cloud custom kernels.
>
>      My question is if the behaviour of blocking listen socket port
> while the accepted port (which, as I understand, does not have any
> direct relation to listen port anymore from TCP standpoint) is still in
> TIME_ or other wait is stipulated by TCP requirements which I am
> missing? Or, if not, maybe that can be changed?
>

Please raise these questions at IETF, this is where major TCP changes
need to be approved.

There are multiple ways to avoid TIME_WAIT, if you really need to.


> Thanks,
>      Paul.
>
>
> On 10/14/22 11:20, Eric Dumazet wrote:
> > On Fri, Oct 14, 2022 at 8:52 AM Paul Gofman <pgofman@...eweavers.com> wrote:
> >> Hello Eric,
> >>
> >> our problem is actually not with the accept socket / port for which
> >> those timeouts apply, we don't care for that temporary port number. The
> >> problem is that the listen port (to which apps bind explicitly) is also
> >> busy until the accept socket waits through all the necessary timeouts
> >> and is fully closed. From my reading of TCP specs I don't understand why
> >> it should be this way. The TCP hazards stipulating those timeouts seem
> >> to apply to accept (connection) socket / port only. Shouldn't listen
> >> socket's port (the only one we care about) be available for bind
> >> immediately after the app stops listening on it (either due to closing
> >> the listen socket or process force kill), or maybe have some other
> >> timeouts not related to connected accept socket / port hazards? Or am I
> >> missing something why it should be the way it is done now?
> >>
> >
> > To quote your initial message :
> >
> > <quote>
> > We are able to avoid this error by adding SO_REUSEADDR attribute to the
> > socket in a hack. But this hack cannot be added to the application
> > process as we don't own it.
> > </quote>
> >
> > Essentially you are complaining of the linux kernel being unable to
> > run a buggy application.
> >
> > We are not going to change the linux kernel because you can not
> > fix/recompile an application.
> >
> > Note that you could use LD_PRELOAD, or maybe eBPF to automatically
> > turn SO_REUSEADDR before bind()
> >
> >
> >> Thanks,
> >>       Paul.
> >>
> >>
> >> On 9/30/22 10:16, Eric Dumazet wrote:
> >>> On Fri, Sep 30, 2022 at 6:24 AM Muhammad Usama Anjum
> >>> <usama.anjum@...labora.com> wrote:
> >>>> Hi Eric,
> >>>>
> >>>> RFC 1337 describes the TIME-WAIT Assassination Hazards in TCP. Because
> >>>> of this hazard we have 60 seconds timeout in TIME_WAIT state if
> >>>> connection isn't closed properly. From RFC 1337:
> >>>>> The TIME-WAIT delay allows all old duplicate segments time
> >>>> enough to die in the Internet before the connection is reopened.
> >>>>
> >>>> As on localhost there is virtually no delay. I think the TIME-WAIT delay
> >>>> must be zero for localhost connections. I'm no expert here. On localhost
> >>>> there is no delay. So why should we wait for 60 seconds to mitigate a
> >>>> hazard which isn't there?
> >>> Because we do not specialize TCP stack for loopback.
> >>>
> >>> It is easy to force delays even for loopback (tc qdisc add dev lo root
> >>> netem ...)
> >>>
> >>> You can avoid TCP complexity (cpu costs) over loopback using AF_UNIX instead.
> >>>
> >>> TIME_WAIT sockets are optional.
> >>> If you do not like them, simply set /proc/sys/net/ipv4/tcp_max_tw_buckets to 0 ?
> >>>
> >>>> Zapping the sockets in TIME_WAIT and FIN_WAIT_2 does removes them. But
> >>>> zap is required from privileged (CAP_NET_ADMIN) process. We are having
> >>>> hard time finding a privileged process to do this.
> >>> Really, we are not going to add kludges in TCP stacks because of this reason.
> >>>
> >>>> Thanks,
> >>>> Usama
> >>>>
> >>>>
> >>>> On 5/24/22 1:18 PM, Muhammad Usama Anjum wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We have a set of processes which talk with each other through a local
> >>>>> TCP socket. If the process(es) are killed (through SIGKILL) and
> >>>>> restarted at once, the bind() fails with EADDRINUSE error. This error
> >>>>> only appears if application is restarted at once without waiting for 60
> >>>>> seconds or more. It seems that there is some timeout of 60 seconds for
> >>>>> which the previous TCP connection remains alive waiting to get closed
> >>>>> completely. In that duration if we try to connect again, we get the error.
> >>>>>
> >>>>> We are able to avoid this error by adding SO_REUSEADDR attribute to the
> >>>>> socket in a hack. But this hack cannot be added to the application
> >>>>> process as we don't own it.
> >>>>>
> >>>>> I've looked at the TCP connection states after killing processes in
> >>>>> different ways. The TCP connection ends up in 2 different states with
> >>>>> timeouts:
> >>>>>
> >>>>> (1) Timeout associated with FIN_WAIT_1 state which is set through
> >>>>> `tcp_fin_timeout` in procfs (60 seconds by default)
> >>>>>
> >>>>> (2) Timeout associated with TIME_WAIT state which cannot be changed. It
> >>>>> seems like this timeout has come from RFC 1337.
> >>>>>
> >>>>> The timeout in (1) can be changed. Timeout in (2) cannot be changed. It
> >>>>> also doesn't seem feasible to change the timeout of TIME_WAIT state as
> >>>>> the RFC mentions several hazards. But we are talking about a local TCP
> >>>>> connection where maybe those hazards aren't applicable directly? Is it
> >>>>> possible to change timeout for TIME_WAIT state for only local
> >>>>> connections without any hazards?
> >>>>>
> >>>>> We have tested a hack where we replace timeout of TIME_WAIT state from a
> >>>>> value in procfs for local connections. This solves our problem and
> >>>>> application starts to work without any modifications to it.
> >>>>>
> >>>>> The question is that what can be the best possible solution here? Any
> >>>>> thoughts will be very helpful.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>> --
> >>>> Muhammad Usama Anjum
> >>
>