[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBeEXmyugUrqxct9VuYrSErVDA61nZ1Y62w8-NSwgdxjw@mail.gmail.com>
Date: Mon, 17 Nov 2025 17:16:26 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH v3 net-next 3/3] net: use napi_skb_cache even in process context
On Mon, Nov 17, 2025 at 4:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Nov 17, 2025 at 12:41 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 9:07 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 4:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
> > > > with alien skbs").
> > > >
> > > > Now the per-cpu napi_skb_cache is populated from TX completion path,
> > > > we can make use of this cache, especially for cpus not used
> > > > from a driver NAPI poll (primary user of napi_cache).
> > > >
> > > > We can use the napi_skb_cache only if current context is not from hard irq.
> > > >
> > > > With this patch, I consistently reach 130 Mpps on my UDP tx stress test
> > > > and reduce SLUB spinlock contention to smaller values.
> > > >
> > > > Note there is still some SLUB contention for skb->head allocations.
> > > >
> > > > I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
> > > > and /sys/kernel/slab/skbuff_small_head/min_partial depending
> > > > on the platform taxonomy.
> > > >
> > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > >
> > > Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
> > >
> > > Thanks for working on this. Previously I was thinking about this as
> > > well since it affects the hot path for xsk (please see
> > > __xsk_generic_xmit()->xsk_build_skb()->sock_alloc_send_pskb()). But I
> > > wasn't aware of the benefits between disabling irq and allocating
> > > memory. AFAIK, I once removed an enabling/disabling irq pair and saw a
> > > minor improvement as this commit[1] says. Would you share your
> > > invaluable experience with us in this case?
> > >
> > > In the meantime, I will do more rounds of experiments to see how they perform.
> >
> > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > Done! I managed to see an improvement. The pps number goes from
> > 1,458,644 to 1,647,235 by running [2].
> >
> > But sadly the news is that the previous commit [3] leads to a huge
> > decrease in af_xdp from 1,980,000 to 1,458,644. With commit [3]
> > applied, I observed and found xdpsock always allocated the skb on cpu
> > 0 but the napi poll triggered skb_attempt_defer_free() on another
> > call[4], which affected the final results.
> >
> > [2]
> > taskset -c 0 ./xdpsock -i enp2s0f1 -q 1 -t -S -s 64
> >
> > [3]
> > commit e20dfbad8aab2b7c72571ae3c3e2e646d6b04cb7
> > Author: Eric Dumazet <edumazet@...gle.com>
> > Date: Thu Nov 6 20:29:34 2025 +0000
> >
> > net: fix napi_consume_skb() with alien skbs
> >
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > [4]
> > @c[
> > skb_attempt_defer_free+1
> > ixgbe_clean_tx_irq+723
> > ixgbe_poll+119
> > __napi_poll+48
> > , ksoftirqd/24]: 1964731
> >
> > @c[
> > kick_defer_list_purge+1
> > napi_consume_skb+333
> > ixgbe_clean_tx_irq+723
> > ixgbe_poll+119
> > , 34, swapper/34]: 123779
> >
> > Thanks,
> > Jason
>
> Hi Jason.
>
> It is a bit hard to guess without more details (cpu you are using),
> and perhaps perf profiles.
Xdpsock only calculates the speed on the cpu where it sends packets.
To put in more details, it will check if the packets are sent by
inspecting the completion queue and then send another group of packets
over and over again.
My test env is relatively old:
[root@...alhost ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
BIOS Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
CPU @ 2.3GHz
BIOS CPU family: 179
CPU family: 6
Model: 63
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
Stepping: 2
CPU(s) scaling MHz: 84%
CPU max MHz: 3100.0000
CPU min MHz: 1200.0000
After that commit [3], the perf differs because of the interrupts
jumping in the tx path frequently:
- 98.72% 0.09% xdpsock libc.so.6 [.]
__libc_sendto
▒
- __libc_sendto
◆
- 98.28% entry_SYSCALL_64_after_hwframe
▒
- 98.19% do_syscall_64
▒
- 97.94% x64_sys_call
▒
- 97.91% __x64_sys_sendto
▒
- 97.80% __sys_sendto
▒
- 97.28% xsk_sendmsg
▒
- 97.18% __xsk_sendmsg.constprop.0.isra.0
▒
- 87.85% __xsk_generic_xmit
▒
- 41.71% xsk_build_skb
▒
- 33.06% sock_alloc_send_pskb
▒
+ 22.06% alloc_skb_with_frags
▒
+ 7.61% skb_set_owner_w
▒
+ 0.58%
asm_sysvec_call_function_single
▒
1.66% skb_store_bits
▒
1.60% memcpy_orig
▒
0.69% xp_raw_get_data
▒
- 32.56% __dev_direct_xmit
▒
+ 15.62% 0xffffffffa064e5ac
▒
- 7.33% __local_bh_enable_ip
▒
+ 6.18% do_softirq
▒
+ 0.70%
asm_sysvec_call_function_single
▒
+ 5.78% validate_xmit_skb
▒
- 6.08% asm_sysvec_call_function_single
▒
+ sysvec_call_function_single
▒
+ 1.62% xp_raw_get_data
▒
- 8.29% _raw_spin_lock
▒
+ 3.01% asm_sysvec_call_function_single
Prior to that patch, I didn't see any interrupts coming in, so I
assume the defer part caused the problem.
> In particular which cpu is the bottleneck ?
>
> 1) There is still the missing part about tuning NAPI_SKB_CACHE_SIZE /
> NAPI_SKB_CACHE_BULK, I was hoping you could send the patch we
> discussed earlier ?
Sure thing :) I've done that part locally, thinking I will post it as
long as the current series gets merged?
>
> 2) I am also working on allowing batches of skbs for skb_attempt_defer_free().
>
> Another item I am working on is to let the qdisc being serviced
> preferably not by the cpu performing TX completion,
> I mentioned about making qdisc->running a sequence that we can latch
> in __netif_schedule().
> (Idea is to be able to not spin on qdisc spinlock from net_tx_action()
> if another cpu was able to call qdisc_run())
Awesome work. Hope to see it very soon :)
Thanks,
Jason
Powered by blists - more mailing lists