linux-kernel - Re: [patch 0/3] net, refcount: Address dst_entry reference count scalability issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iL2pYt2QA2sS4KkXrCSjprz9byE_p+Geom3MTNPMzfFDw@mail.gmail.com>
Date:   Tue, 28 Feb 2023 16:07:01 +0100
From:   Eric Dumazet <edumazet@...gle.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...uxfoundation.org>, x86@...nel.org,
        Wangyang Guo <wangyang.guo@...el.com>,
        Arjan van De Ven <arjan@...ux.intel.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Mark Rutland <mark.rutland@....com>,
        Marc Zyngier <maz@...nel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues

On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Hi!
>
> Wangyang and Arjan reported a bottleneck in the networking code related to
> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
> a dst_entry increases.

We have per-cpu or per-tcp-socket dst though.

Input path is RCU and does not touch dst refcnt.

In real workloads (200Gbit NIC and above), we do not observe
contention on a dst refcnt.

So it would be nice knowing in which case you noticed some issues,
maybe there is something wrong in some layer.

>
> This happens when there are a large amount of connections to or from the
> same IP address. The memtier benchmark when run on the same host as
> memcached amplifies this massively. But even over real network connections
> this issue can be observed at an obviously smaller scale (due to the
> network bandwith limitations in my setup, i.e. 1Gb).
>
> There are two factors which make this reference count a scalability issue:
>
>    1) False sharing
>
>       dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
>       it into a seperate cacheline vs. the read mostly members located at
>       the beginning of the struct.
>
>       That prevents false sharing vs. the struct members in the first 64
>       bytes of the structure, but there is also
>
>             dst_entry::lwtstate
>
>       which is located after the reference count and in the same cache
>       line. This member is read after a reference count has been acquired.
>
>       The other problem is struct rtable, which embeds a struct dst_entry
>       at offset 0. struct dst_entry has a size of 112 bytes, which means
>       that the struct members of rtable which follow the dst member share
>       the same cache line as dst_entry::__refcnt. Especially
>
>           rtable::rt_genid
>
>       is also read by the contexts which have a reference count acquired
>       already.
>
>       When dst_entry:__refcnt is incremented or decremented via an atomic
>       operation these read accesses stall and contribute to the performance
>       problem.

In our kernel profiles, we never saw dst->refcnt changes being a
serious problem.

>
>    2) atomic_inc_not_zero()
>
>       A reference on dst_entry:__refcnt is acquired via
>       atomic_inc_not_zero() and released via atomic_dec_return().
>
>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>       which exposes O(N^2) behaviour under contention with N concurrent
>       operations.
>
>       Lightweight instrumentation exposed an average of 8!! retry loops per
>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>       running concurrently on 112 CPUs.

User space benchmark <> kernel space.
And we tend not using 112 cpus for kernel stack processing.

Again, concurrent dst->refcnt changes are quite unlikely.

>
>       There is nothing which can be done to make atomic_inc_not_zero() more
>       scalable.
>
> The following series addresses these issues:
>
>     1) Reorder and pad struct dst_entry to prevent the false sharing.
>
>     2) Implement and use a reference count implementation which avoids the
>        atomic_inc_not_zero() problem.
>
>        It is slightly less performant in the case of the final 1 -> 0
>        transition, but the deconstruction of these objects is a low
>        frequency event. get()/put() pairs are in the hotpath and that's
>        what this implementation optimizes for.
>
>        The algorithm of this reference count is only suitable for RCU
>        managed objects. Therefore it cannot replace the refcount_t
>        algorithm, which is also based on atomic_inc_not_zero(), due to a
>        subtle race condition related to the 1 -> 0 transition and the final
>        verdict to mark the reference count dead. See details in patch 2/3.
>
>        It might be just my lack of imagination which declares this to be
>        impossible and I'd be happy to be proven wrong.
>
>        As a bonus the new rcuref implementation provides underflow/overflow
>        detection and mitigation while being performance wise on par with
>        open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
>        the non-contended case.
>
> The combination of these two changes results in performance gains in micro
> benchmarks and also localhost and networked memtier benchmarks talking to
> memcached. It's hard to quantify the benchmark results as they depend
> heavily on the micro-architecture and the number of concurrent operations.
>
> The overall gain of both changes for localhost memtier ranges from 1.2X to
> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>
> A micro benchmark which enforces maximized concurrency shows a gain between
> 1.2X and 4.7X!!!

Can you elaborate on what networking benchmark you have used,
and what actual gains you got ?

In which path access to dst->lwtstate proved to be a problem ?

>
> Obviously this is focussed on a particular problem and therefore needs to
> be discussed in detail. It also requires wider testing outside of the cases
> which this is focussed on.
>
> Though the false sharing issue is obvious and should be addressed
> independent of the more focussed reference count changes.
>
> The series is also available from git:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git
>
> I want to say thanks to Wangyang who analyzed the issue and provided
> the initial fix for the false sharing problem. Further thanks go to
> Arjan Peter, Marc, Will and Borislav for valuable input and providing
> test results on machines which I do not have access to.
>
> Thoughts?

Initial feeling is that we need more details on the real workload.

Is it some degenerated case with one connected UDP socket used by
multiple threads ?

To me, this looks like someone wanted to push a new piece of infra
(include/linux/rcuref.h)
and decided that dst->refcnt would be a perfect place.
Not the other way (noticing there is an issue, enquire networking
folks about it, before designing a solution)

Thanks



>
> Thanks,
>
>         tglx
> ---
>  include/linux/rcuref.h          |   89 +++++++++++
>  include/linux/types.h           |    6
>  include/net/dst.h               |   21 ++
>  include/net/sock.h              |    2
>  lib/Makefile                    |    2
>  lib/rcuref.c                    |  311 ++++++++++++++++++++++++++++++++++++++++
>  net/bridge/br_nf_core.c         |    2
>  net/core/dst.c                  |   26 ---
>  net/core/rtnetlink.c            |    2
>  net/ipv6/route.c                |    6
>  net/netfilter/ipvs/ip_vs_xmit.c |    4
>  11 files changed, 436 insertions(+), 35 deletions(-)