netdev - Re: [net-next v1] net: Add a second bind table hashed by port and address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAJnrk1btHhosTt_PwW77NK1frmZ2Q8j4DYEB9+7H_VP5iocqcg@mail.gmail.com>
Date:   Fri, 22 Apr 2022 15:55:12 -0700
From:   Joanne Koong <joannelkoong@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>, Martin KaFai Lau <kafai@...com>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [net-next v1] net: Add a second bind table hashed by port and address

On Fri, Apr 22, 2022 at 2:25 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Fri, Apr 22, 2022 at 2:07 PM Joanne Koong <joannelkoong@...il.com> wrote:
> >
> > On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@...il.com> wrote:
> > > >
> > > > We currently have one tcp bind table (bhash) which hashes by port
> > > > number only. In the socket bind path, we check for bind conflicts by
> > > > traversing the specified port's inet_bind2_bucket while holding the
> > > > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> > > >
> > > > In instances where there are tons of sockets hashed to the same port
> > > > at different addresses, checking for a bind conflict is time-intensive
> > > > and can cause softirq cpu lockups, as well as stops new tcp connections
> > > > since __inet_inherit_port() also contests for the spinlock.
> > > >
> > > > This patch proposes adding a second bind table, bhash2, that hashes by
> > > > port and ip address. Searching the bhash2 table leads to significantly
> > > > faster conflict resolution and less time holding the spinlock.
> > > > When experimentally testing this on a local server, the results for how
> > > > long a bind request takes were as follows:
> > > >
> > > > when there are ~24k sockets already bound to the port -
> > > >
> > > > ipv4:
> > > > before - 0.002317 seconds
> > > > with bhash2 - 0.000018 seconds
> > > >
> > > > ipv6:
> > > > before - 0.002431 seconds
> > > > with bhash2 - 0.000021 seconds
> > >
> > >
> > > Hi Joanne
> > >
> > > Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
> > >
> > > I fear we add some extra code and cost for quite an unusual configuration.
> > >
> > > Thanks.
> > >
> > Hi Eric,
> >
> > I have a test on my local server that populates the bhash table entry
> > with 24k sockets for a given port and address, and then times how long
> > a bind request on that port takes.
>
> OK, but why 24k ? Why not 24 M then ?
>
> In this case, will a 64K hash table be big enough ?
24k was one test case scenario, another one was ~12M; these were used
to get a sense of how the bhash2 table performs in situations where
the bhash table entry for the port is saturated.
>
>  When populating the table entry, I
> > use the same IPv6 address on the host (with SO_REUSEADDR set). At
> > Facebook, there are some internal teams that submit bind requests for
> > 400 vips on the same port on concurrent threads that run into softirq
> > lockup issues due to the bhash table entry spinlock contention, which
> > is the main motivation behind this patch.
>
> I am pretty sure the IPv6 stack does not scale well if we have
> thousands of IPv6 addresses on one netdev.
> Some O(N) behavior will also trigger latency violations.
>
> Can you share the test, in a form that can be added in linux tree ?
I will include it somewhere under testing/selftests/net - does that sound okay?
>
> I mean, before today nobody was trying to have 24k listeners on a host,
> so it would be nice to have a regression test for future changes in the stack.
>
> If the goal is to deal with 400 vips, why using 24k in your changelog ?
> I would rather stick to the reality, and not pretend TCP stack should
> scale to 24k listeners.
I chose 24k to test on because one of the internal team's usages is
binding from 80 workers for ~300 vips in parallel for the same port.
>
> I have not looked at the patch yet, I choked on the changelog for
> being exaggerated.