linux-kernel - Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iL_ey=S=FjkhJ+mk7gabOdVag6ENKnu9GnZkcF31qOaZA@mail.gmail.com>
Date:   Tue, 28 Feb 2023 17:59:52 +0100
From:   Eric Dumazet <edumazet@...gle.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...uxfoundation.org>, x86@...nel.org,
        Wangyang Guo <wangyang.guo@...el.com>,
        Arjan van De Ven <arjan@...ux.intel.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Mark Rutland <mark.rutland@....com>,
        Marc Zyngier <maz@...nel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues

On Tue, Feb 28, 2023 at 5:38 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Eric!
>
> On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> > On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@...utronix.de> wrote:
> >>
> >> Hi!
> >>
> >> Wangyang and Arjan reported a bottleneck in the networking code related to
> >> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> >> a dst_entry increases.
> >
> > We have per-cpu or per-tcp-socket dst though.
> >
> > Input path is RCU and does not touch dst refcnt.
> >
> > In real workloads (200Gbit NIC and above), we do not observe
> > contention on a dst refcnt.
> >
> > So it would be nice knowing in which case you noticed some issues,
> > maybe there is something wrong in some layer.
>
> Two lines further down I explained which benchmark was used, no?
>
> >> This happens when there are a large amount of connections to or from the
> >> same IP address. The memtier benchmark when run on the same host as
> >> memcached amplifies this massively. But even over real network connections
> >> this issue can be observed at an obviously smaller scale (due to the
> >> network bandwith limitations in my setup, i.e. 1Gb).
> >>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
> >>       which exposes O(N^2) behaviour under contention with N concurrent
> >>       operations.
> >>
> >>       Lightweight instrumentation exposed an average of 8!! retry loops per
> >>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
> >>       running concurrently on 112 CPUs.
> >
> > User space benchmark <> kernel space.
>
> I know that. The point was to illustrate the non-scalability.
>
> > And we tend not using 112 cpus for kernel stack processing.
> >
> > Again, concurrent dst->refcnt changes are quite unlikely.
>
> So unlikely that they stand out in that particular benchmark.
>
> >> The overall gain of both changes for localhost memtier ranges from 1.2X to
> >> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
> >>
> >> A micro benchmark which enforces maximized concurrency shows a gain between
> >> 1.2X and 4.7X!!!
> >
> > Can you elaborate on what networking benchmark you have used,
> > and what actual gains you got ?
>
> I'm happy to repeat here that it was memtier/memcached as I explained
> more than once in the cover letter.
>
> > In which path access to dst->lwtstate proved to be a problem ?
>
> ip_finish_output2()
>    if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

This change alone should be easy to measure, please do this ?

Oftentimes, moving a field looks sane, but the cache line access is
simply done later.
For example when refcnt is changed :)

Making dsts one cache line bigger has a performance impact.

>
> > To me, this looks like someone wanted to push a new piece of infra
> > (include/linux/rcuref.h)
> > and decided that dst->refcnt would be a perfect place.
> >
> > Not the other way (noticing there is an issue, enquire networking
> > folks about it, before designing a solution)
>
> We looked at this because the reference count operations stood out in
> perf top and we analyzed it down to the false sharing _and_ the
> non-scalability of atomic_inc_not_zero().
>

Please share your recipe and perf results.

We must have been very lucky to not see this at Google.

tcp_rr workloads show dst_mtu() costs (60k GCU in Google fleet) ,
which outperform the dst refcnt you are mentioning here.