lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoC8v9QpTxRJWA17ciu=sB-RAZJ_eWNZZTVFYwUXEQHtbA@mail.gmail.com>
Date: Tue, 18 Nov 2025 10:06:27 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH v3 net-next 3/3] net: use napi_skb_cache even in process context

On Mon, Nov 17, 2025 at 10:31 PM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Mon, Nov 17, 2025 at 5:48 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 1:17 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 4:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Mon, Nov 17, 2025 at 12:41 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > >
> > > > > On Mon, Nov 17, 2025 at 9:07 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > On Mon, Nov 17, 2025 at 4:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > >
> > > > > > > This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
> > > > > > > with alien skbs").
> > > > > > >
> > > > > > > Now the per-cpu napi_skb_cache is populated from TX completion path,
> > > > > > > we can make use of this cache, especially for cpus not used
> > > > > > > from a driver NAPI poll (primary user of napi_cache).
> > > > > > >
> > > > > > > We can use the napi_skb_cache only if current context is not from hard irq.
> > > > > > >
> > > > > > > With this patch, I consistently reach 130 Mpps on my UDP tx stress test
> > > > > > > and reduce SLUB spinlock contention to smaller values.
> > > > > > >
> > > > > > > Note there is still some SLUB contention for skb->head allocations.
> > > > > > >
> > > > > > > I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
> > > > > > > and /sys/kernel/slab/skbuff_small_head/min_partial depending
> > > > > > > on the platform taxonomy.
> > > > > > >
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > > > >
> > > > > > Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
> > > > > >
> > > > > > Thanks for working on this. Previously I was thinking about this as
> > > > > > well since it affects the hot path for xsk (please see
> > > > > > __xsk_generic_xmit()->xsk_build_skb()->sock_alloc_send_pskb()). But I
> > > > > > wasn't aware of the benefits between disabling irq and allocating
> > > > > > memory. AFAIK, I once removed an enabling/disabling irq pair and saw a
> > > > > > minor improvement as this commit[1] says. Would you share your
> > > > > > invaluable experience with us in this case?
> > > > > >
> > > > > > In the meantime, I will do more rounds of experiments to see how they perform.
> > > > >
> > > > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > > > > Done! I managed to see an improvement. The pps number goes from
> > > > > 1,458,644 to 1,647,235 by running [2].
> > > > >
> > > > > But sadly the news is that the previous commit [3] leads to a huge
> > > > > decrease in af_xdp from 1,980,000 to 1,458,644. With commit [3]
>
> I have to rephrase a bit. What I did is on top of the commit just
> before [3], I tried out the tests to see the number w/o [3] series, so
> the numbers were respectively 1,458,644 and 1,980,000. That means
> reverting commit [3] brought back that much performance.
>
> Furthermore, I did a few more tests and my env changed:
> 1) I tried the latest kernel, the number is around 1,680,000. And then
> revert commit [3], it's increased to 1,850,000.
> 2) I tried to apply the current series, the number was around 1,730,000.
> 3) I tried to apply the current series and reverted [3], the number
> was still around 1,730,000.
>
> That means the current series negates the benefits of that commit [3].
> I've done many rounds of tests, so the above numbers are quite stable.
> I feel I need to directly use the fallback memory allocation instead
> of trying napi_skb_cache_get for xsk? I'm not sure if any theory
> supports which one is good or not. From my instinct, trying
> disabling/enabling is saving time while using napi_skb_cache_get is
> also saving time...
>
> > > > > applied, I observed and found xdpsock always allocated the skb on cpu
> > > > > 0 but the napi poll triggered skb_attempt_defer_free() on another
> > > > > call[4], which affected the final results.
> > > > >
> > > > > [2]
> > > > > taskset -c 0 ./xdpsock -i enp2s0f1 -q 1 -t -S -s 64
> > > > >
> > > > > [3]
> > > > > commit e20dfbad8aab2b7c72571ae3c3e2e646d6b04cb7
> > > > > Author: Eric Dumazet <edumazet@...gle.com>
> > > > > Date:   Thu Nov 6 20:29:34 2025 +0000
> > > > >
> > > > >     net: fix napi_consume_skb() with alien skbs
> > > > >
> > > > >     There is a lack of NUMA awareness and more generally lack
> > > > >     of slab caches affinity on TX completion path.
> > > > >
> > > > > [4]
> > > > > @c[
> > > > >     skb_attempt_defer_free+1
> > > > >     ixgbe_clean_tx_irq+723
> > > > >     ixgbe_poll+119
> > > > >     __napi_poll+48
> > > > > , ksoftirqd/24]: 1964731
> > > > >
> > > > > @c[
> > > > >     kick_defer_list_purge+1
> > > > >     napi_consume_skb+333
> > > > >     ixgbe_clean_tx_irq+723
> > > > >     ixgbe_poll+119
> > > > > , 34, swapper/34]: 123779
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > >
> > > > Hi Jason.
> > > >
> > > > It is a bit hard to guess without more details (cpu you are using),
> > > > and perhaps perf profiles.
> > >
> > > Xdpsock only calculates the speed on the cpu where it sends packets.
> > > To put in more details, it will check if the packets are sent by
> > > inspecting the completion queue and then send another group of packets
> > > over and over again.
> > >
> > > My test env is relatively old:
> > > [root@...alhost ~]# lscpu
> > > Architecture:                x86_64
> > >   CPU op-mode(s):            32-bit, 64-bit
> > >   Address sizes:             46 bits physical, 48 bits virtual
> > >   Byte Order:                Little Endian
> > > CPU(s):                      48
> > >   On-line CPU(s) list:       0-47
> > > Vendor ID:                   GenuineIntel
> > >   BIOS Vendor ID:            Intel(R) Corporation
> > >   Model name:                Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> > >     BIOS Model name:         Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> > >  CPU @ 2.3GHz
> > >     BIOS CPU family:         179
> > >     CPU family:              6
> > >     Model:                   63
> > >     Thread(s) per core:      2
> > >     Core(s) per socket:      12
> > >     Socket(s):               2
> > >     Stepping:                2
> > >     CPU(s) scaling MHz:      84%
> > >     CPU max MHz:             3100.0000
> > >     CPU min MHz:             1200.0000
> > >
> > > After that commit [3], the perf differs because of the interrupts
> > > jumping in the tx path frequently:
> > > -   98.72%     0.09%  xdpsock          libc.so.6          [.]
> > > __libc_sendto
> > >       ▒
> > >    - __libc_sendto
> > >                                                                     ◆
> > >       - 98.28% entry_SYSCALL_64_after_hwframe
> > >                                                                     ▒
> > >          - 98.19% do_syscall_64
> > >                                                                     ▒
> > >             - 97.94% x64_sys_call
> > >                                                                     ▒
> > >                - 97.91% __x64_sys_sendto
> > >                                                                     ▒
> > >                   - 97.80% __sys_sendto
> > >                                                                     ▒
> > >                      - 97.28% xsk_sendmsg
> > >                                                                     ▒
> > >                         - 97.18% __xsk_sendmsg.constprop.0.isra.0
> > >                                                                     ▒
> > >                            - 87.85% __xsk_generic_xmit
> > >                                                                     ▒
> > >                               - 41.71% xsk_build_skb
> > >                                                                     ▒
> > >                                  - 33.06% sock_alloc_send_pskb
> > >                                                                     ▒
> > >                                     + 22.06% alloc_skb_with_frags
> > >                                                                     ▒
> > >                                     + 7.61% skb_set_owner_w
> > >                                                                     ▒
> > >                                     + 0.58%
> > > asm_sysvec_call_function_single
> > >                         ▒
> > >                                    1.66% skb_store_bits
> > >                                                                     ▒
> > >                                    1.60% memcpy_orig
> > >                                                                     ▒
> > >                                    0.69% xp_raw_get_data
> > >                                                                     ▒
> > >                               - 32.56% __dev_direct_xmit
> > >                                                                     ▒
> > >                                  + 15.62% 0xffffffffa064e5ac
> > >                                                                     ▒
> > >                                  - 7.33% __local_bh_enable_ip
> > >                                                                     ▒
> > >                                     + 6.18% do_softirq
> > >                                                                     ▒
> > >                                     + 0.70%
> > > asm_sysvec_call_function_single
> > >                         ▒
> > >                                  + 5.78% validate_xmit_skb
> > >                                                                     ▒
> > >                               - 6.08% asm_sysvec_call_function_single
> > >                                                                     ▒
> > >                                  + sysvec_call_function_single
> > >                                                                     ▒
> > >                               + 1.62% xp_raw_get_data
> > >                                                                     ▒
> > >                            - 8.29% _raw_spin_lock
> > >                                                                     ▒
> > >                               + 3.01% asm_sysvec_call_function_single
> > >
> > > Prior to that patch, I didn't see any interrupts coming in, so I
> > > assume the defer part caused the problem.
> >
> > This is a trade off.
> >
> > If you are using a single cpu to send packets, then it will not really
> > have contention on SLUB structures.
> >
> > A bit like RFS : If you want to reach max throughput on a single flow,
> > it is probably better _not_ using RFS/RPS,
> > so that more than one cpu can be involved in the rx work.
>
> Point taken!
>
> >
> > We can add a static key to enable/disable the behaviors that are most
> > suited to a particular workload.
>
> Are we going to introduce a new knob to control this snippet in
> napi_consume_skb()?

That's it. For single flow in xsk scenarios, adding something like a
static key to avoid 1) using napi_skb_cache_get() in this patch, 2)
deferring free in commit [3] can contribute to a higher performance
back to more than 1,900,000 pps. I have no clue if adding a new sysctl
is acceptable.

Thanks,
Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ