lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87h6v5n3su.ffs@tglx>
Date:   Tue, 28 Feb 2023 17:38:25 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...uxfoundation.org>, x86@...nel.org,
        Wangyang Guo <wangyang.guo@...el.com>,
        Arjan van De Ven <arjan@...ux.intel.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Mark Rutland <mark.rutland@....com>,
        Marc Zyngier <maz@...nel.org>
Subject: Re: [patch 0/3] net, refcount: Address dst_entry reference count
 scalability issues

Eric!

On Tue, Feb 28 2023 at 16:07, Eric Dumazet wrote:
> On Tue, Feb 28, 2023 at 3:33 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>>
>> Hi!
>>
>> Wangyang and Arjan reported a bottleneck in the networking code related to
>> struct dst_entry::__refcnt. Performance tanks massively when concurrency on
>> a dst_entry increases.
>
> We have per-cpu or per-tcp-socket dst though.
>
> Input path is RCU and does not touch dst refcnt.
>
> In real workloads (200Gbit NIC and above), we do not observe
> contention on a dst refcnt.
>
> So it would be nice knowing in which case you noticed some issues,
> maybe there is something wrong in some layer.

Two lines further down I explained which benchmark was used, no?

>> This happens when there are a large amount of connections to or from the
>> same IP address. The memtier benchmark when run on the same host as
>> memcached amplifies this massively. But even over real network connections
>> this issue can be observed at an obviously smaller scale (due to the
>> network bandwith limitations in my setup, i.e. 1Gb).
>>       atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
>>       which exposes O(N^2) behaviour under contention with N concurrent
>>       operations.
>>
>>       Lightweight instrumentation exposed an average of 8!! retry loops per
>>       atomic_inc_not_zero() invocation in a userspace inc()/dec() loop
>>       running concurrently on 112 CPUs.
>
> User space benchmark <> kernel space.

I know that. The point was to illustrate the non-scalability.

> And we tend not using 112 cpus for kernel stack processing.
>
> Again, concurrent dst->refcnt changes are quite unlikely.

So unlikely that they stand out in that particular benchmark.

>> The overall gain of both changes for localhost memtier ranges from 1.2X to
>> 3.2X and from +2% to %5% range for networked operations on a 1Gb connection.
>>
>> A micro benchmark which enforces maximized concurrency shows a gain between
>> 1.2X and 4.7X!!!
>
> Can you elaborate on what networking benchmark you have used,
> and what actual gains you got ?

I'm happy to repeat here that it was memtier/memcached as I explained
more than once in the cover letter.

> In which path access to dst->lwtstate proved to be a problem ?

ip_finish_output2()
   if (lwtunnel_xmit_redirect(dst->lwtstate)) <- This read

> To me, this looks like someone wanted to push a new piece of infra
> (include/linux/rcuref.h)
> and decided that dst->refcnt would be a perfect place.
>
> Not the other way (noticing there is an issue, enquire networking
> folks about it, before designing a solution)

We looked at this because the reference count operations stood out in
perf top and we analyzed it down to the false sharing _and_ the
non-scalability of atomic_inc_not_zero().

That's what made me to think about a different implementation and yes,
the case at hand, which happens to be in the network code, allowed me to
verify that it actually works and scales better.

I'm terrible sorry, that I missed to first ask network people for
permission to think. Won't happen again.

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ