lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKOkHHJ-papcMXJvq_8xSE2zXvqTfNSfGhq=Y1y_oKy6A@mail.gmail.com>
Date:   Fri, 22 Apr 2022 14:25:22 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Joanne Koong <joannelkoong@...il.com>
Cc:     netdev <netdev@...r.kernel.org>, Martin KaFai Lau <kafai@...com>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [net-next v1] net: Add a second bind table hashed by port and address

On Fri, Apr 22, 2022 at 2:07 PM Joanne Koong <joannelkoong@...il.com> wrote:
>
> On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@...il.com> wrote:
> > >
> > > We currently have one tcp bind table (bhash) which hashes by port
> > > number only. In the socket bind path, we check for bind conflicts by
> > > traversing the specified port's inet_bind2_bucket while holding the
> > > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> > >
> > > In instances where there are tons of sockets hashed to the same port
> > > at different addresses, checking for a bind conflict is time-intensive
> > > and can cause softirq cpu lockups, as well as stops new tcp connections
> > > since __inet_inherit_port() also contests for the spinlock.
> > >
> > > This patch proposes adding a second bind table, bhash2, that hashes by
> > > port and ip address. Searching the bhash2 table leads to significantly
> > > faster conflict resolution and less time holding the spinlock.
> > > When experimentally testing this on a local server, the results for how
> > > long a bind request takes were as follows:
> > >
> > > when there are ~24k sockets already bound to the port -
> > >
> > > ipv4:
> > > before - 0.002317 seconds
> > > with bhash2 - 0.000018 seconds
> > >
> > > ipv6:
> > > before - 0.002431 seconds
> > > with bhash2 - 0.000021 seconds
> >
> >
> > Hi Joanne
> >
> > Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
> >
> > I fear we add some extra code and cost for quite an unusual configuration.
> >
> > Thanks.
> >
> Hi Eric,
>
> I have a test on my local server that populates the bhash table entry
> with 24k sockets for a given port and address, and then times how long
> a bind request on that port takes.

OK, but why 24k ? Why not 24 M then ?

In this case, will a 64K hash table be big enough ?

 When populating the table entry, I
> use the same IPv6 address on the host (with SO_REUSEADDR set). At
> Facebook, there are some internal teams that submit bind requests for
> 400 vips on the same port on concurrent threads that run into softirq
> lockup issues due to the bhash table entry spinlock contention, which
> is the main motivation behind this patch.



I am pretty sure the IPv6 stack does not scale well if we have
thousands of IPv6 addresses on one netdev.
Some O(N) behavior will also trigger latency violations.

Can you share the test, in a form that can be added in linux tree ?

I mean, before today nobody was trying to have 24k listeners on a host,
so it would be nice to have a regression test for future changes in the stack.

If the goal is to deal with 400 vips, why using 24k in your changelog ?
I would rather stick to the reality, and not pretend TCP stack should
scale to 24k listeners.

I have not looked at the patch yet, I choked on the changelog for
being exaggerated.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ