netdev - Re: [PATCH v3 net-next 3/3] net: use napi_skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAJR3Du1ZsJC5KU=pNB7G9FP+qYVe8314GXu8xv7-PC3g@mail.gmail.com>
Date: Mon, 17 Nov 2025 22:31:36 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH v3 net-next 3/3] net: use napi_skb_cache even in process context

On Mon, Nov 17, 2025 at 5:48 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Nov 17, 2025 at 1:17 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 4:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 12:41 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Mon, Nov 17, 2025 at 9:07 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > >
> > > > > On Mon, Nov 17, 2025 at 4:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > >
> > > > > > This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
> > > > > > with alien skbs").
> > > > > >
> > > > > > Now the per-cpu napi_skb_cache is populated from TX completion path,
> > > > > > we can make use of this cache, especially for cpus not used
> > > > > > from a driver NAPI poll (primary user of napi_cache).
> > > > > >
> > > > > > We can use the napi_skb_cache only if current context is not from hard irq.
> > > > > >
> > > > > > With this patch, I consistently reach 130 Mpps on my UDP tx stress test
> > > > > > and reduce SLUB spinlock contention to smaller values.
> > > > > >
> > > > > > Note there is still some SLUB contention for skb->head allocations.
> > > > > >
> > > > > > I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
> > > > > > and /sys/kernel/slab/skbuff_small_head/min_partial depending
> > > > > > on the platform taxonomy.
> > > > > >
> > > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > > >
> > > > > Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
> > > > >
> > > > > Thanks for working on this. Previously I was thinking about this as
> > > > > well since it affects the hot path for xsk (please see
> > > > > __xsk_generic_xmit()->xsk_build_skb()->sock_alloc_send_pskb()). But I
> > > > > wasn't aware of the benefits between disabling irq and allocating
> > > > > memory. AFAIK, I once removed an enabling/disabling irq pair and saw a
> > > > > minor improvement as this commit[1] says. Would you share your
> > > > > invaluable experience with us in this case?
> > > > >
> > > > > In the meantime, I will do more rounds of experiments to see how they perform.
> > > >
> > > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > > > Done! I managed to see an improvement. The pps number goes from
> > > > 1,458,644 to 1,647,235 by running [2].
> > > >
> > > > But sadly the news is that the previous commit [3] leads to a huge
> > > > decrease in af_xdp from 1,980,000 to 1,458,644. With commit [3]

I have to rephrase a bit. What I did is on top of the commit just
before [3], I tried out the tests to see the number w/o [3] series, so
the numbers were respectively 1,458,644 and 1,980,000. That means
reverting commit [3] brought back that much performance.

Furthermore, I did a few more tests and my env changed:
1) I tried the latest kernel, the number is around 1,680,000. And then
revert commit [3], it's increased to 1,850,000.
2) I tried to apply the current series, the number was around 1,730,000.
3) I tried to apply the current series and reverted [3], the number
was still around 1,730,000.

That means the current series negates the benefits of that commit [3].
I've done many rounds of tests, so the above numbers are quite stable.
I feel I need to directly use the fallback memory allocation instead
of trying napi_skb_cache_get for xsk? I'm not sure if any theory
supports which one is good or not. From my instinct, trying
disabling/enabling is saving time while using napi_skb_cache_get is
also saving time...

> > > > applied, I observed and found xdpsock always allocated the skb on cpu
> > > > 0 but the napi poll triggered skb_attempt_defer_free() on another
> > > > call[4], which affected the final results.
> > > >
> > > > [2]
> > > > taskset -c 0 ./xdpsock -i enp2s0f1 -q 1 -t -S -s 64
> > > >
> > > > [3]
> > > > commit e20dfbad8aab2b7c72571ae3c3e2e646d6b04cb7
> > > > Author: Eric Dumazet <edumazet@...gle.com>
> > > > Date:   Thu Nov 6 20:29:34 2025 +0000
> > > >
> > > >     net: fix napi_consume_skb() with alien skbs
> > > >
> > > >     There is a lack of NUMA awareness and more generally lack
> > > >     of slab caches affinity on TX completion path.
> > > >
> > > > [4]
> > > > @c[
> > > >     skb_attempt_defer_free+1
> > > >     ixgbe_clean_tx_irq+723
> > > >     ixgbe_poll+119
> > > >     __napi_poll+48
> > > > , ksoftirqd/24]: 1964731
> > > >
> > > > @c[
> > > >     kick_defer_list_purge+1
> > > >     napi_consume_skb+333
> > > >     ixgbe_clean_tx_irq+723
> > > >     ixgbe_poll+119
> > > > , 34, swapper/34]: 123779
> > > >
> > > > Thanks,
> > > > Jason
> > >
> > > Hi Jason.
> > >
> > > It is a bit hard to guess without more details (cpu you are using),
> > > and perhaps perf profiles.
> >
> > Xdpsock only calculates the speed on the cpu where it sends packets.
> > To put in more details, it will check if the packets are sent by
> > inspecting the completion queue and then send another group of packets
> > over and over again.
> >
> > My test env is relatively old:
> > [root@...alhost ~]# lscpu
> > Architecture:                x86_64
> >   CPU op-mode(s):            32-bit, 64-bit
> >   Address sizes:             46 bits physical, 48 bits virtual
> >   Byte Order:                Little Endian
> > CPU(s):                      48
> >   On-line CPU(s) list:       0-47
> > Vendor ID:                   GenuineIntel
> >   BIOS Vendor ID:            Intel(R) Corporation
> >   Model name:                Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> >     BIOS Model name:         Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> >  CPU @ 2.3GHz
> >     BIOS CPU family:         179
> >     CPU family:              6
> >     Model:                   63
> >     Thread(s) per core:      2
> >     Core(s) per socket:      12
> >     Socket(s):               2
> >     Stepping:                2
> >     CPU(s) scaling MHz:      84%
> >     CPU max MHz:             3100.0000
> >     CPU min MHz:             1200.0000
> >
> > After that commit [3], the perf differs because of the interrupts
> > jumping in the tx path frequently:
> > -   98.72%     0.09%  xdpsock          libc.so.6          [.]
> > __libc_sendto
> >       ▒
> >    - __libc_sendto
> >                                                                     ◆
> >       - 98.28% entry_SYSCALL_64_after_hwframe
> >                                                                     ▒
> >          - 98.19% do_syscall_64
> >                                                                     ▒
> >             - 97.94% x64_sys_call
> >                                                                     ▒
> >                - 97.91% __x64_sys_sendto
> >                                                                     ▒
> >                   - 97.80% __sys_sendto
> >                                                                     ▒
> >                      - 97.28% xsk_sendmsg
> >                                                                     ▒
> >                         - 97.18% __xsk_sendmsg.constprop.0.isra.0
> >                                                                     ▒
> >                            - 87.85% __xsk_generic_xmit
> >                                                                     ▒
> >                               - 41.71% xsk_build_skb
> >                                                                     ▒
> >                                  - 33.06% sock_alloc_send_pskb
> >                                                                     ▒
> >                                     + 22.06% alloc_skb_with_frags
> >                                                                     ▒
> >                                     + 7.61% skb_set_owner_w
> >                                                                     ▒
> >                                     + 0.58%
> > asm_sysvec_call_function_single
> >                         ▒
> >                                    1.66% skb_store_bits
> >                                                                     ▒
> >                                    1.60% memcpy_orig
> >                                                                     ▒
> >                                    0.69% xp_raw_get_data
> >                                                                     ▒
> >                               - 32.56% __dev_direct_xmit
> >                                                                     ▒
> >                                  + 15.62% 0xffffffffa064e5ac
> >                                                                     ▒
> >                                  - 7.33% __local_bh_enable_ip
> >                                                                     ▒
> >                                     + 6.18% do_softirq
> >                                                                     ▒
> >                                     + 0.70%
> > asm_sysvec_call_function_single
> >                         ▒
> >                                  + 5.78% validate_xmit_skb
> >                                                                     ▒
> >                               - 6.08% asm_sysvec_call_function_single
> >                                                                     ▒
> >                                  + sysvec_call_function_single
> >                                                                     ▒
> >                               + 1.62% xp_raw_get_data
> >                                                                     ▒
> >                            - 8.29% _raw_spin_lock
> >                                                                     ▒
> >                               + 3.01% asm_sysvec_call_function_single
> >
> > Prior to that patch, I didn't see any interrupts coming in, so I
> > assume the defer part caused the problem.
>
> This is a trade off.
>
> If you are using a single cpu to send packets, then it will not really
> have contention on SLUB structures.
>
> A bit like RFS : If you want to reach max throughput on a single flow,
> it is probably better _not_ using RFS/RPS,
> so that more than one cpu can be involved in the rx work.

Point taken!

>
> We can add a static key to enable/disable the behaviors that are most
> suited to a particular workload.

Are we going to introduce a new knob to control this snippet in
napi_consume_skb()?

>
> We could also call skb_defer_free_flush() (now it is IRQ safe) from
> napi_skb_cache_get() before
> we attempt to allocate new sk_buff. This would prevent IPI from being sent.

Well, it can but it adds more complexity into the allocation phase. I
would prefer the static key approach :)

Thanks,
Jason

>
> >
> > > In particular which cpu is the bottleneck ?
> > >
> > > 1) There is still the missing part about tuning NAPI_SKB_CACHE_SIZE /
> > > NAPI_SKB_CACHE_BULK, I was hoping you could send the patch we
> > > discussed earlier ?
> >
> > Sure thing :) I've done that part locally, thinking I will post it as
> > long as the current series gets merged?
> >
> > >
> > > 2) I am also working on allowing batches of skbs for skb_attempt_defer_free().
> > >
> > > Another item I am working on is to let the qdisc being serviced
> > > preferably not by the cpu performing TX completion,
> > > I mentioned about making qdisc->running a sequence that we can latch
> > > in __netif_schedule().
> > > (Idea is to be able to not spin on qdisc spinlock from net_tx_action()
> > > if another cpu was able to call qdisc_run())
> >
> > Awesome work. Hope to see it very soon :)
> >
> > Thanks,
> > Jason