netdev - Re: [PATCH v3 net-next 3/3] net: use napi_skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJec+7ssKpAp0Um5pNecfzxohRJBKQybSYS=e-9pQjqag@mail.gmail.com>
Date: Mon, 17 Nov 2025 01:48:06 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH v3 net-next 3/3] net: use napi_skb_cache even in process context

On Mon, Nov 17, 2025 at 1:17 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Mon, Nov 17, 2025 at 4:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 12:41 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 9:07 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Mon, Nov 17, 2025 at 4:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
> > > > > with alien skbs").
> > > > >
> > > > > Now the per-cpu napi_skb_cache is populated from TX completion path,
> > > > > we can make use of this cache, especially for cpus not used
> > > > > from a driver NAPI poll (primary user of napi_cache).
> > > > >
> > > > > We can use the napi_skb_cache only if current context is not from hard irq.
> > > > >
> > > > > With this patch, I consistently reach 130 Mpps on my UDP tx stress test
> > > > > and reduce SLUB spinlock contention to smaller values.
> > > > >
> > > > > Note there is still some SLUB contention for skb->head allocations.
> > > > >
> > > > > I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
> > > > > and /sys/kernel/slab/skbuff_small_head/min_partial depending
> > > > > on the platform taxonomy.
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > >
> > > > Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
> > > >
> > > > Thanks for working on this. Previously I was thinking about this as
> > > > well since it affects the hot path for xsk (please see
> > > > __xsk_generic_xmit()->xsk_build_skb()->sock_alloc_send_pskb()). But I
> > > > wasn't aware of the benefits between disabling irq and allocating
> > > > memory. AFAIK, I once removed an enabling/disabling irq pair and saw a
> > > > minor improvement as this commit[1] says. Would you share your
> > > > invaluable experience with us in this case?
> > > >
> > > > In the meantime, I will do more rounds of experiments to see how they perform.
> > >
> > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > > Done! I managed to see an improvement. The pps number goes from
> > > 1,458,644 to 1,647,235 by running [2].
> > >
> > > But sadly the news is that the previous commit [3] leads to a huge
> > > decrease in af_xdp from 1,980,000 to 1,458,644. With commit [3]
> > > applied, I observed and found xdpsock always allocated the skb on cpu
> > > 0 but the napi poll triggered skb_attempt_defer_free() on another
> > > call[4], which affected the final results.
> > >
> > > [2]
> > > taskset -c 0 ./xdpsock -i enp2s0f1 -q 1 -t -S -s 64
> > >
> > > [3]
> > > commit e20dfbad8aab2b7c72571ae3c3e2e646d6b04cb7
> > > Author: Eric Dumazet <edumazet@...gle.com>
> > > Date:   Thu Nov 6 20:29:34 2025 +0000
> > >
> > >     net: fix napi_consume_skb() with alien skbs
> > >
> > >     There is a lack of NUMA awareness and more generally lack
> > >     of slab caches affinity on TX completion path.
> > >
> > > [4]
> > > @c[
> > >     skb_attempt_defer_free+1
> > >     ixgbe_clean_tx_irq+723
> > >     ixgbe_poll+119
> > >     __napi_poll+48
> > > , ksoftirqd/24]: 1964731
> > >
> > > @c[
> > >     kick_defer_list_purge+1
> > >     napi_consume_skb+333
> > >     ixgbe_clean_tx_irq+723
> > >     ixgbe_poll+119
> > > , 34, swapper/34]: 123779
> > >
> > > Thanks,
> > > Jason
> >
> > Hi Jason.
> >
> > It is a bit hard to guess without more details (cpu you are using),
> > and perhaps perf profiles.
>
> Xdpsock only calculates the speed on the cpu where it sends packets.
> To put in more details, it will check if the packets are sent by
> inspecting the completion queue and then send another group of packets
> over and over again.
>
> My test env is relatively old:
> [root@...alhost ~]# lscpu
> Architecture:                x86_64
>   CPU op-mode(s):            32-bit, 64-bit
>   Address sizes:             46 bits physical, 48 bits virtual
>   Byte Order:                Little Endian
> CPU(s):                      48
>   On-line CPU(s) list:       0-47
> Vendor ID:                   GenuineIntel
>   BIOS Vendor ID:            Intel(R) Corporation
>   Model name:                Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>     BIOS Model name:         Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>  CPU @ 2.3GHz
>     BIOS CPU family:         179
>     CPU family:              6
>     Model:                   63
>     Thread(s) per core:      2
>     Core(s) per socket:      12
>     Socket(s):               2
>     Stepping:                2
>     CPU(s) scaling MHz:      84%
>     CPU max MHz:             3100.0000
>     CPU min MHz:             1200.0000
>
> After that commit [3], the perf differs because of the interrupts
> jumping in the tx path frequently:
> -   98.72%     0.09%  xdpsock          libc.so.6          [.]
> __libc_sendto
>       ▒
>    - __libc_sendto
>                                                                     ◆
>       - 98.28% entry_SYSCALL_64_after_hwframe
>                                                                     ▒
>          - 98.19% do_syscall_64
>                                                                     ▒
>             - 97.94% x64_sys_call
>                                                                     ▒
>                - 97.91% __x64_sys_sendto
>                                                                     ▒
>                   - 97.80% __sys_sendto
>                                                                     ▒
>                      - 97.28% xsk_sendmsg
>                                                                     ▒
>                         - 97.18% __xsk_sendmsg.constprop.0.isra.0
>                                                                     ▒
>                            - 87.85% __xsk_generic_xmit
>                                                                     ▒
>                               - 41.71% xsk_build_skb
>                                                                     ▒
>                                  - 33.06% sock_alloc_send_pskb
>                                                                     ▒
>                                     + 22.06% alloc_skb_with_frags
>                                                                     ▒
>                                     + 7.61% skb_set_owner_w
>                                                                     ▒
>                                     + 0.58%
> asm_sysvec_call_function_single
>                         ▒
>                                    1.66% skb_store_bits
>                                                                     ▒
>                                    1.60% memcpy_orig
>                                                                     ▒
>                                    0.69% xp_raw_get_data
>                                                                     ▒
>                               - 32.56% __dev_direct_xmit
>                                                                     ▒
>                                  + 15.62% 0xffffffffa064e5ac
>                                                                     ▒
>                                  - 7.33% __local_bh_enable_ip
>                                                                     ▒
>                                     + 6.18% do_softirq
>                                                                     ▒
>                                     + 0.70%
> asm_sysvec_call_function_single
>                         ▒
>                                  + 5.78% validate_xmit_skb
>                                                                     ▒
>                               - 6.08% asm_sysvec_call_function_single
>                                                                     ▒
>                                  + sysvec_call_function_single
>                                                                     ▒
>                               + 1.62% xp_raw_get_data
>                                                                     ▒
>                            - 8.29% _raw_spin_lock
>                                                                     ▒
>                               + 3.01% asm_sysvec_call_function_single
>
> Prior to that patch, I didn't see any interrupts coming in, so I
> assume the defer part caused the problem.

This is a trade off.

If you are using a single cpu to send packets, then it will not really
have contention on SLUB structures.

A bit like RFS : If you want to reach max throughput on a single flow,
it is probably better _not_ using RFS/RPS,
so that more than one cpu can be involved in the rx work.

We can add a static key to enable/disable the behaviors that are most
suited to a particular workload.

We could also call skb_defer_free_flush() (now it is IRQ safe) from
napi_skb_cache_get() before
we attempt to allocate new sk_buff. This would prevent IPI from being sent.

>
> > In particular which cpu is the bottleneck ?
> >
> > 1) There is still the missing part about tuning NAPI_SKB_CACHE_SIZE /
> > NAPI_SKB_CACHE_BULK, I was hoping you could send the patch we
> > discussed earlier ?
>
> Sure thing :) I've done that part locally, thinking I will post it as
> long as the current series gets merged?
>
> >
> > 2) I am also working on allowing batches of skbs for skb_attempt_defer_free().
> >
> > Another item I am working on is to let the qdisc being serviced
> > preferably not by the cpu performing TX completion,
> > I mentioned about making qdisc->running a sequence that we can latch
> > in __netif_schedule().
> > (Idea is to be able to not spin on qdisc spinlock from net_tx_action()
> > if another cpu was able to call qdisc_run())
>
> Awesome work. Hope to see it very soon :)
>
> Thanks,
> Jason