netdev - Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoBprZrKkfzCE6vdRVxMfZjN8dKNB+NRi37JjESx-dPABQ@mail.gmail.com>
Date: Wed, 3 Dec 2025 19:16:30 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>
Cc: Paolo Abeni <pabeni@...hat.com>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, bjorn@...nel.org, magnus.karlsson@...el.com, 
	maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me, 
	ast@...nel.org, daniel@...earbox.net, hawk@...nel.org, 
	john.fastabend@...il.com, horms@...nel.org, andrew+netdev@...n.ch, 
	bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 2/3] xsk: use atomic operations around
 cached_prod for copy mode

On Wed, Dec 3, 2025 at 5:40 PM Magnus Karlsson
<magnus.karlsson@...il.com> wrote:
>
> On Wed, 3 Dec 2025 at 10:25, Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On 12/3/25 7:56 AM, Jason Xing wrote:
> > > On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@...hat.com> wrote:
> > >>> On 11/28/25 2:46 PM, Jason Xing wrote:
> > >>>> From: Jason Xing <kernelxing@...cent.com>
> > >>>>
> > >>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> > >>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
> > >>>> especially when we only need to perform a few simple operations.
> > >>>> Similar idea can also be found in the recent commit 100dfa74cad9
> > >>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> > >>>> logic with the help of try_cmpxchg.
> > >>>>
> > >>>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > >>>> ---
> > >>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> > >>>> because after investigation I reckon try_cmpxchg can add less overhead
> > >>>> when multiple xsks contend at this point. So I hope this approach
> > >>>> can be adopted.
> > >>>
> > >>> I still think that moving the lock would be preferable, because it makes
> > >>> sense also from a maintenance perspective.
> > >>
> > >> I can see your point here. Sure, moving the lock is relatively easier
> > >> to understand. But my take is that atomic changes here are not that
> > >> hard to read :) It has the same effect as spin lock because it will
> > >> atomically check, compare and set in try_cmpxchg().
> > >>
> > >>> Can you report the difference
> > >>> you measure atomics vs moving the spin lock?
> > >>
> > >> No problem, hopefully I will give a detailed report next week because
> > >> I'm going to apply it directly in production where we have multiple
> > >> xsk sharing the same umem.
> > >
> > > I'm done with the test in production where a few applications rely on
> > > multiple xsks sharing the same pool to send UDP packets. Here are
> > > significant numbers from bcc tool that recorded the latency caused by
> > > these particular functions:
> > >
> > > 1. use spin lock
> > > $ sudo ./funclatency xsk_cq_reserve_locked
> > > Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> > > ^C
> > >      nsecs               : count     distribution
> > >          0 -> 1          : 0        |                                        |
> > >          2 -> 3          : 0        |                                        |
> > >          4 -> 7          : 0        |                                        |
> > >          8 -> 15         : 0        |                                        |
> > >         16 -> 31         : 0        |                                        |
> > >         32 -> 63         : 0        |                                        |
> > >         64 -> 127        : 0        |                                        |
> > >        128 -> 255        : 25308114 |**                                      |
> > >        256 -> 511        : 283924647 |**********************                  |
> > >        512 -> 1023       : 501589652 |****************************************|
> > >       1024 -> 2047       : 93045664 |*******                                 |
> > >       2048 -> 4095       : 746395   |                                        |
> > >       4096 -> 8191       : 424053   |                                        |
> > >       8192 -> 16383      : 1041     |                                        |
> > >      16384 -> 32767      : 0        |                                        |
> > >      32768 -> 65535      : 0        |                                        |
> > >      65536 -> 131071     : 0        |                                        |
> > >     131072 -> 262143     : 0        |                                        |
> > >     262144 -> 524287     : 0        |                                        |
> > >     524288 -> 1048575    : 6        |                                        |
> > >    1048576 -> 2097151    : 2        |                                        |
> > >
> > > avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
> > >
> > > 2. use atomic
> > > $ sudo ./funclatency xsk_cq_cached_prod_reserve
> > > Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> > > ^C
> > >      nsecs               : count     distribution
> > >          0 -> 1          : 0        |                                        |
> > >          2 -> 3          : 0        |                                        |
> > >          4 -> 7          : 0        |                                        |
> > >          8 -> 15         : 0        |                                        |
> > >         16 -> 31         : 0        |                                        |
> > >         32 -> 63         : 0        |                                        |
> > >         64 -> 127        : 0        |                                        |
> > >        128 -> 255        : 109815401 |*********                               |
> > >        256 -> 511        : 485028947 |****************************************|
> > >        512 -> 1023       : 320121627 |**************************              |
> > >       1024 -> 2047       : 38538584 |***                                     |
> > >       2048 -> 4095       : 377026   |                                        |
> > >       4096 -> 8191       : 340961   |                                        |
> > >       8192 -> 16383      : 549      |                                        |
> > >      16384 -> 32767      : 0        |                                        |
> > >      32768 -> 65535      : 0        |                                        |
> > >      65536 -> 131071     : 0        |                                        |
> > >     131072 -> 262143     : 0        |                                        |
> > >     262144 -> 524287     : 0        |                                        |
> > >     524288 -> 1048575    : 10       |                                        |
> > >
> > > avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
> > >
> > > And those numbers were verified over and over again which means they
> > > are quite stable.
> > >
> > > You can see that when using atomic, the avg is smaller and the count
> > > of [128 -> 255] is larger, which shows better performance.
> > >
> > > I will add the above numbers in the commit log after the merge window is open.
> >
> > It's not just a matter of performance. Spinlock additionally give you
> > fairness and lockdep guarantees, beyond being easier to graps for
> > however is going to touch this code in the future, while raw atomic none
> > of them.

Right.

> >
> > From a maintainability perspective spinlocks are much more preferable.
> >
> > IMHO micro-benchmarking is not a strong enough argument to counter the
> > spinlock adavantages: at very _least_ large performance gain should be
> > observed in relevant test-cases and/or real live workloads.

The problem is that I have no good benchmark to see the minor
improvement because it requires multiple xsks. Xdpsock has bugs and
doesn't allow two xsks running in parallel. Unlike one xsk scenario,
it's really difficult for me currently to measure the improvement. So
I resorted to observing latency.

With that said, I will follow your suggestion to move that lock :)

> >
> > >>> Have you tried moving cq_prod_lock, too?
> > >>
> > >> Not yet, thanks for reminding me. It should not affect the sending
> > >> rate but the tx completion time, I think.
> > >
> > > I also tried moving this lock, but sadly I noticed that in completion
> > > time the lock was set which led to invalidation of the cache line of
> > > another thread sending packets. It can be obviously proved by perf
> > > cycles:ppp:
> > > 1. before
> > > 8.70% xsk_cq_cached_prod_reserve
> > >
> > > 2. after
> > > 12.31% xsk_cq_cached_prod_reserve
> > >
> > > So I decided not to bring such a modification. Anyway, thanks for your
> > > valuable suggestions and I learnt a lot from those interesting
> > > experiments.
> >
> > The goal of such change would be reducing the number of touched
> > cachelines;

Yep.

>> when I suggested the above, I did not dive into the producer
> > specifics, I assumed the relevant producer data were inside the
> > xsk_queue struct.
> >
> > It looks like the data is actually inside 'struct xdp_ring', so the
> > producer lock should be moved there, specifically:
> >
> > struct xdp_ring {
> >         u32 producer ____cacheline_aligned_in_smp;
> >         spinlock_t producer_lock;
> >         // ...
>
> This struct is reflected to user space, so we should not put the spin
> lock there.

Agree on this point.

> But you could put it in struct xsk_queue, but perhaps you

My concern is that only cq uses this lock while this structure is used
by all types of queues.

> would want to call it something more generic as there would be a lock
> present in all queues/rings, though you would only use it for the cq.
> Some of the members in xsk_queue are nearly always used when
> manipulating the ring, so the cache line should be hot.
>
> I am just thinking aloud if this would be correct. There is one pool
> per cq. When a pool is shared, the cq belonging to that pool is also
> always shared, so I think that would be correct moving the lock from
> the pool to the cq.

Exactly, that is how Paolo suggested previously.

>
> > I'm a bit lost in the structs indirection, but I think the above would
> > be beneficial even for the ZC path.

Spot on, that will be part of my future plan. Thanks!

Thanks,
Jason