netdev - Re: [PATCH net-next v3 2/3] xsk: use atomic operations around cached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz2hjzea40-H3W5VAru7S+iFeVJ-2VoaFWjVqyFm3WpUKg@mail.gmail.com>
Date: Wed, 3 Dec 2025 10:40:36 +0100
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Jason Xing <kerneljasonxing@...il.com>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, bjorn@...nel.org, magnus.karlsson@...el.com, 
	maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me, 
	ast@...nel.org, daniel@...earbox.net, hawk@...nel.org, 
	john.fastabend@...il.com, horms@...nel.org, andrew+netdev@...n.ch, 
	bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 2/3] xsk: use atomic operations around
 cached_prod for copy mode

On Wed, 3 Dec 2025 at 10:25, Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 12/3/25 7:56 AM, Jason Xing wrote:
> > On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@...hat.com> wrote:
> >>> On 11/28/25 2:46 PM, Jason Xing wrote:
> >>>> From: Jason Xing <kernelxing@...cent.com>
> >>>>
> >>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> >>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
> >>>> especially when we only need to perform a few simple operations.
> >>>> Similar idea can also be found in the recent commit 100dfa74cad9
> >>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> >>>> logic with the help of try_cmpxchg.
> >>>>
> >>>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> >>>> ---
> >>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> >>>> because after investigation I reckon try_cmpxchg can add less overhead
> >>>> when multiple xsks contend at this point. So I hope this approach
> >>>> can be adopted.
> >>>
> >>> I still think that moving the lock would be preferable, because it makes
> >>> sense also from a maintenance perspective.
> >>
> >> I can see your point here. Sure, moving the lock is relatively easier
> >> to understand. But my take is that atomic changes here are not that
> >> hard to read :) It has the same effect as spin lock because it will
> >> atomically check, compare and set in try_cmpxchg().
> >>
> >>> Can you report the difference
> >>> you measure atomics vs moving the spin lock?
> >>
> >> No problem, hopefully I will give a detailed report next week because
> >> I'm going to apply it directly in production where we have multiple
> >> xsk sharing the same umem.
> >
> > I'm done with the test in production where a few applications rely on
> > multiple xsks sharing the same pool to send UDP packets. Here are
> > significant numbers from bcc tool that recorded the latency caused by
> > these particular functions:
> >
> > 1. use spin lock
> > $ sudo ./funclatency xsk_cq_reserve_locked
> > Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> > ^C
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 25308114 |**                                      |
> >        256 -> 511        : 283924647 |**********************                  |
> >        512 -> 1023       : 501589652 |****************************************|
> >       1024 -> 2047       : 93045664 |*******                                 |
> >       2048 -> 4095       : 746395   |                                        |
> >       4096 -> 8191       : 424053   |                                        |
> >       8192 -> 16383      : 1041     |                                        |
> >      16384 -> 32767      : 0        |                                        |
> >      32768 -> 65535      : 0        |                                        |
> >      65536 -> 131071     : 0        |                                        |
> >     131072 -> 262143     : 0        |                                        |
> >     262144 -> 524287     : 0        |                                        |
> >     524288 -> 1048575    : 6        |                                        |
> >    1048576 -> 2097151    : 2        |                                        |
> >
> > avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
> >
> > 2. use atomic
> > $ sudo ./funclatency xsk_cq_cached_prod_reserve
> > Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> > ^C
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 0        |                                        |
> >        128 -> 255        : 109815401 |*********                               |
> >        256 -> 511        : 485028947 |****************************************|
> >        512 -> 1023       : 320121627 |**************************              |
> >       1024 -> 2047       : 38538584 |***                                     |
> >       2048 -> 4095       : 377026   |                                        |
> >       4096 -> 8191       : 340961   |                                        |
> >       8192 -> 16383      : 549      |                                        |
> >      16384 -> 32767      : 0        |                                        |
> >      32768 -> 65535      : 0        |                                        |
> >      65536 -> 131071     : 0        |                                        |
> >     131072 -> 262143     : 0        |                                        |
> >     262144 -> 524287     : 0        |                                        |
> >     524288 -> 1048575    : 10       |                                        |
> >
> > avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
> >
> > And those numbers were verified over and over again which means they
> > are quite stable.
> >
> > You can see that when using atomic, the avg is smaller and the count
> > of [128 -> 255] is larger, which shows better performance.
> >
> > I will add the above numbers in the commit log after the merge window is open.
>
> It's not just a matter of performance. Spinlock additionally give you
> fairness and lockdep guarantees, beyond being easier to graps for
> however is going to touch this code in the future, while raw atomic none
> of them.
>
> From a maintainability perspective spinlocks are much more preferable.
>
> IMHO micro-benchmarking is not a strong enough argument to counter the
> spinlock adavantages: at very _least_ large performance gain should be
> observed in relevant test-cases and/or real live workloads.
>
> >>> Have you tried moving cq_prod_lock, too?
> >>
> >> Not yet, thanks for reminding me. It should not affect the sending
> >> rate but the tx completion time, I think.
> >
> > I also tried moving this lock, but sadly I noticed that in completion
> > time the lock was set which led to invalidation of the cache line of
> > another thread sending packets. It can be obviously proved by perf
> > cycles:ppp:
> > 1. before
> > 8.70% xsk_cq_cached_prod_reserve
> >
> > 2. after
> > 12.31% xsk_cq_cached_prod_reserve
> >
> > So I decided not to bring such a modification. Anyway, thanks for your
> > valuable suggestions and I learnt a lot from those interesting
> > experiments.
>
> The goal of such change would be reducing the number of touched
> cachelines; when I suggested the above, I did not dive into the producer
> specifics, I assumed the relevant producer data were inside the
> xsk_queue struct.
>
> It looks like the data is actually inside 'struct xdp_ring', so the
> producer lock should be moved there, specifically:
>
> struct xdp_ring {
>         u32 producer ____cacheline_aligned_in_smp;
>         spinlock_t producer_lock;
>         // ...

This struct is reflected to user space, so we should not put the spin
lock there. But you could put it in struct xsk_queue, but perhaps you
would want to call it something more generic as there would be a lock
present in all queues/rings, though you would only use it for the cq.
Some of the members in xsk_queue are nearly always used when
manipulating the ring, so the cache line should be hot.

I am just thinking aloud if this would be correct. There is one pool
per cq. When a pool is shared, the cq belonging to that pool is also
always shared, so I think that would be correct moving the lock from
the pool to the cq.

> I'm a bit lost in the structs indirection, but I think the above would
> be beneficial even for the ZC path.
>
> /P
>
>