[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBMRdMevWCS1puVD4zEDt+69S6t2r6Ov8tw7zhgq_n=PA@mail.gmail.com>
Date: Wed, 3 Dec 2025 14:56:42 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
bjorn@...nel.org, magnus.karlsson@...el.com, maciej.fijalkowski@...el.com,
jonathan.lemon@...il.com, sdf@...ichev.me, ast@...nel.org,
daniel@...earbox.net, hawk@...nel.org, john.fastabend@...il.com,
horms@...nel.org, andrew+netdev@...n.ch, bpf@...r.kernel.org,
netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 2/3] xsk: use atomic operations around
cached_prod for copy mode
Hi Paolo,
On Sat, Nov 29, 2025 at 8:55 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Fri, Nov 28, 2025 at 10:20 PM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On 11/28/25 2:46 PM, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@...cent.com>
> > >
> > > Use atomic_try_cmpxchg operations to replace spin lock. Technically
> > > CAS (Compare And Swap) is better than a coarse way like spin-lock
> > > especially when we only need to perform a few simple operations.
> > > Similar idea can also be found in the recent commit 100dfa74cad9
> > > ("net: dev_queue_xmit() llist adoption") that implements the lockless
> > > logic with the help of try_cmpxchg.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > ---
> > > Paolo, sorry that I didn't try to move the lock to struct xsk_queue
> > > because after investigation I reckon try_cmpxchg can add less overhead
> > > when multiple xsks contend at this point. So I hope this approach
> > > can be adopted.
> >
> > I still think that moving the lock would be preferable, because it makes
> > sense also from a maintenance perspective.
>
> I can see your point here. Sure, moving the lock is relatively easier
> to understand. But my take is that atomic changes here are not that
> hard to read :) It has the same effect as spin lock because it will
> atomically check, compare and set in try_cmpxchg().
>
> > Can you report the difference
> > you measure atomics vs moving the spin lock?
>
> No problem, hopefully I will give a detailed report next week because
> I'm going to apply it directly in production where we have multiple
> xsk sharing the same umem.
I'm done with the test in production where a few applications rely on
multiple xsks sharing the same pool to send UDP packets. Here are
significant numbers from bcc tool that recorded the latency caused by
these particular functions:
1. use spin lock
$ sudo ./funclatency xsk_cq_reserve_locked
Tracing 1 functions for "xsk_cq_reserve_locked"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 25308114 |** |
256 -> 511 : 283924647 |********************** |
512 -> 1023 : 501589652 |****************************************|
1024 -> 2047 : 93045664 |******* |
2048 -> 4095 : 746395 | |
4096 -> 8191 : 424053 | |
8192 -> 16383 : 1041 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 6 | |
1048576 -> 2097151 : 2 | |
avg = 664 nsecs, total: 601186432273 nsecs, count: 905039574
2. use atomic
$ sudo ./funclatency xsk_cq_cached_prod_reserve
Tracing 1 functions for "xsk_cq_cached_prod_reserve"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 109815401 |********* |
256 -> 511 : 485028947 |****************************************|
512 -> 1023 : 320121627 |************************** |
1024 -> 2047 : 38538584 |*** |
2048 -> 4095 : 377026 | |
4096 -> 8191 : 340961 | |
8192 -> 16383 : 549 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 10 | |
avg = 496 nsecs, total: 473682265261 nsecs, count: 954223105
And those numbers were verified over and over again which means they
are quite stable.
You can see that when using atomic, the avg is smaller and the count
of [128 -> 255] is larger, which shows better performance.
I will add the above numbers in the commit log after the merge window is open.
>
> IMHO, in theory, atomics is way better than spin lock in contended
> cases since the protected area is small and fast.
I also spent time investigating the details of both approaches. Spin
lock uses something like atomic_try_cmpxchg first and then fallbacks
to slow path. That is more complicated than atomic. And the protected
area is small enough and simple calculations don't bother asking one
thread to set a few things and then wait.
>
> >
> > Have you tried moving cq_prod_lock, too?
>
> Not yet, thanks for reminding me. It should not affect the sending
> rate but the tx completion time, I think.
I also tried moving this lock, but sadly I noticed that in completion
time the lock was set which led to invalidation of the cache line of
another thread sending packets. It can be obviously proved by perf
cycles:ppp:
1. before
8.70% xsk_cq_cached_prod_reserve
2. after
12.31% xsk_cq_cached_prod_reserve
So I decided not to bring such a modification. Anyway, thanks for your
valuable suggestions and I learnt a lot from those interesting
experiments.
Thanks,
Jason
Powered by blists - more mailing lists