[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz2hjzea40-H3W5VAru7S+iFeVJ-2VoaFWjVqyFm3WpUKg@mail.gmail.com>
Date: Wed, 3 Dec 2025 10:40:36 +0100
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Jason Xing <kerneljasonxing@...il.com>, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, bjorn@...nel.org, magnus.karlsson@...el.com,
maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me,
ast@...nel.org, daniel@...earbox.net, hawk@...nel.org,
john.fastabend@...il.com, horms@...nel.org, andrew+netdev@...n.ch,
bpf@...r.kernel.org, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 2/3] xsk: use atomic operations around
cached_prod for copy mode
On Wed, 3 Dec 2025 at 10:25, Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 12/3/25 7:56 AM, Jason Xing wrote:
> > On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@...hat.com> wrote:
> >>> On 11/28/25 2:46 PM, Jason Xing wrote:
> >>>> From: Jason Xing <kernelxing@...cent.com>
> >>>>
> >>>> Use atomic_try_cmpxchg operations to replace spin lock. Technically
> >>>> CAS (Compare And Swap) is better than a coarse way like spin-lock
> >>>> especially when we only need to perform a few simple operations.
> >>>> Similar idea can also be found in the recent commit 100dfa74cad9
> >>>> ("net: dev_queue_xmit() llist adoption") that implements the lockless
> >>>> logic with the help of try_cmpxchg.
> >>>>
> >>>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> >>>> ---
> >>>> Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> >>>> because after investigation I reckon try_cmpxchg can add less overhead
> >>>> when multiple xsks contend at this point. So I hope this approach
> >>>> can be adopted.
> >>>
> >>> I still think that moving the lock would be preferable, because it makes
> >>> sense also from a maintenance perspective.
> >>
> >> I can see your point here. Sure, moving the lock is relatively easier
> >> to understand. But my take is that atomic changes here are not that
> >> hard to read :) It has the same effect as spin lock because it will
> >> atomically check, compare and set in try_cmpxchg().
> >>
> >>> Can you report the difference
> >>> you measure atomics vs moving the spin lock?
> >>
> >> No problem, hopefully I will give a detailed report next week because
> >> I'm going to apply it directly in production where we have multiple
> >> xsk sharing the same umem.
> >
> > I'm done with the test in production where a few applications rely on
> > multiple xsks sharing the same pool to send UDP packets. Here are
> > significant numbers from bcc tool that recorded the latency caused by
> > these particular functions:
> >
> > 1. use spin lock
> > $ sudo ./funclatency xsk_cq_reserve_locked
> > Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
> > ^C
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 25308114 |** |
> > 256 -> 511 : 283924647 |********************** |
> > 512 -> 1023 : 501589652 |****************************************|
> > 1024 -> 2047 : 93045664 |******* |
> > 2048 -> 4095 : 746395 | |
> > 4096 -> 8191 : 424053 | |
> > 8192 -> 16383 : 1041 | |
> > 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 6 | |
> > 1048576 -> 2097151 : 2 | |
> >
> > avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
> >
> > 2. use atomic
> > $ sudo ./funclatency xsk_cq_cached_prod_reserve
> > Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
> > ^C
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 109815401 |********* |
> > 256 -> 511 : 485028947 |****************************************|
> > 512 -> 1023 : 320121627 |************************** |
> > 1024 -> 2047 : 38538584 |*** |
> > 2048 -> 4095 : 377026 | |
> > 4096 -> 8191 : 340961 | |
> > 8192 -> 16383 : 549 | |
> > 16384 -> 32767 : 0 | |
> > 32768 -> 65535 : 0 | |
> > 65536 -> 131071 : 0 | |
> > 131072 -> 262143 : 0 | |
> > 262144 -> 524287 : 0 | |
> > 524288 -> 1048575 : 10 | |
> >
> > avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
> >
> > And those numbers were verified over and over again which means they
> > are quite stable.
> >
> > You can see that when using atomic, the avg is smaller and the count
> > of [128 -> 255] is larger, which shows better performance.
> >
> > I will add the above numbers in the commit log after the merge window is open.
>
> It's not just a matter of performance. Spinlock additionally give you
> fairness and lockdep guarantees, beyond being easier to graps for
> however is going to touch this code in the future, while raw atomic none
> of them.
>
> From a maintainability perspective spinlocks are much more preferable.
>
> IMHO micro-benchmarking is not a strong enough argument to counter the
> spinlock adavantages: at very _least_ large performance gain should be
> observed in relevant test-cases and/or real live workloads.
>
> >>> Have you tried moving cq_prod_lock, too?
> >>
> >> Not yet, thanks for reminding me. It should not affect the sending
> >> rate but the tx completion time, I think.
> >
> > I also tried moving this lock, but sadly I noticed that in completion
> > time the lock was set which led to invalidation of the cache line of
> > another thread sending packets. It can be obviously proved by perf
> > cycles:ppp:
> > 1. before
> > 8.70% xsk_cq_cached_prod_reserve
> >
> > 2. after
> > 12.31% xsk_cq_cached_prod_reserve
> >
> > So I decided not to bring such a modification. Anyway, thanks for your
> > valuable suggestions and I learnt a lot from those interesting
> > experiments.
>
> The goal of such change would be reducing the number of touched
> cachelines; when I suggested the above, I did not dive into the producer
> specifics, I assumed the relevant producer data were inside the
> xsk_queue struct.
>
> It looks like the data is actually inside 'struct xdp_ring', so the
> producer lock should be moved there, specifically:
>
> struct xdp_ring {
> u32 producer ____cacheline_aligned_in_smp;
> spinlock_t producer_lock;
> // ...
This struct is reflected to user space, so we should not put the spin
lock there. But you could put it in struct xsk_queue, but perhaps you
would want to call it something more generic as there would be a lock
present in all queues/rings, though you would only use it for the cq.
Some of the members in xsk_queue are nearly always used when
manipulating the ring, so the cache line should be hot.
I am just thinking aloud if this would be correct. There is one pool
per cq. When a pool is shared, the cq belonging to that pool is also
always shared, so I think that would be correct moving the lock from
the pool to the cq.
> I'm a bit lost in the structs indirection, but I think the above would
> be beneficial even for the ZC path.
>
> /P
>
>
Powered by blists - more mailing lists