[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJec+7ssKpAp0Um5pNecfzxohRJBKQybSYS=e-9pQjqag@mail.gmail.com>
Date: Mon, 17 Nov 2025 01:48:06 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH v3 net-next 3/3] net: use napi_skb_cache even in process context
On Mon, Nov 17, 2025 at 1:17 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Mon, Nov 17, 2025 at 4:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 12:41 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 9:07 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Mon, Nov 17, 2025 at 4:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb()
> > > > > with alien skbs").
> > > > >
> > > > > Now the per-cpu napi_skb_cache is populated from TX completion path,
> > > > > we can make use of this cache, especially for cpus not used
> > > > > from a driver NAPI poll (primary user of napi_cache).
> > > > >
> > > > > We can use the napi_skb_cache only if current context is not from hard irq.
> > > > >
> > > > > With this patch, I consistently reach 130 Mpps on my UDP tx stress test
> > > > > and reduce SLUB spinlock contention to smaller values.
> > > > >
> > > > > Note there is still some SLUB contention for skb->head allocations.
> > > > >
> > > > > I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial
> > > > > and /sys/kernel/slab/skbuff_small_head/min_partial depending
> > > > > on the platform taxonomy.
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > >
> > > > Reviewed-by: Jason Xing <kerneljasonxing@...il.com>
> > > >
> > > > Thanks for working on this. Previously I was thinking about this as
> > > > well since it affects the hot path for xsk (please see
> > > > __xsk_generic_xmit()->xsk_build_skb()->sock_alloc_send_pskb()). But I
> > > > wasn't aware of the benefits between disabling irq and allocating
> > > > memory. AFAIK, I once removed an enabling/disabling irq pair and saw a
> > > > minor improvement as this commit[1] says. Would you share your
> > > > invaluable experience with us in this case?
> > > >
> > > > In the meantime, I will do more rounds of experiments to see how they perform.
> > >
> > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > > Done! I managed to see an improvement. The pps number goes from
> > > 1,458,644 to 1,647,235 by running [2].
> > >
> > > But sadly the news is that the previous commit [3] leads to a huge
> > > decrease in af_xdp from 1,980,000 to 1,458,644. With commit [3]
> > > applied, I observed and found xdpsock always allocated the skb on cpu
> > > 0 but the napi poll triggered skb_attempt_defer_free() on another
> > > call[4], which affected the final results.
> > >
> > > [2]
> > > taskset -c 0 ./xdpsock -i enp2s0f1 -q 1 -t -S -s 64
> > >
> > > [3]
> > > commit e20dfbad8aab2b7c72571ae3c3e2e646d6b04cb7
> > > Author: Eric Dumazet <edumazet@...gle.com>
> > > Date: Thu Nov 6 20:29:34 2025 +0000
> > >
> > > net: fix napi_consume_skb() with alien skbs
> > >
> > > There is a lack of NUMA awareness and more generally lack
> > > of slab caches affinity on TX completion path.
> > >
> > > [4]
> > > @c[
> > > skb_attempt_defer_free+1
> > > ixgbe_clean_tx_irq+723
> > > ixgbe_poll+119
> > > __napi_poll+48
> > > , ksoftirqd/24]: 1964731
> > >
> > > @c[
> > > kick_defer_list_purge+1
> > > napi_consume_skb+333
> > > ixgbe_clean_tx_irq+723
> > > ixgbe_poll+119
> > > , 34, swapper/34]: 123779
> > >
> > > Thanks,
> > > Jason
> >
> > Hi Jason.
> >
> > It is a bit hard to guess without more details (cpu you are using),
> > and perhaps perf profiles.
>
> Xdpsock only calculates the speed on the cpu where it sends packets.
> To put in more details, it will check if the packets are sent by
> inspecting the completion queue and then send another group of packets
> over and over again.
>
> My test env is relatively old:
> [root@...alhost ~]# lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 46 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 48
> On-line CPU(s) list: 0-47
> Vendor ID: GenuineIntel
> BIOS Vendor ID: Intel(R) Corporation
> Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> BIOS Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> CPU @ 2.3GHz
> BIOS CPU family: 179
> CPU family: 6
> Model: 63
> Thread(s) per core: 2
> Core(s) per socket: 12
> Socket(s): 2
> Stepping: 2
> CPU(s) scaling MHz: 84%
> CPU max MHz: 3100.0000
> CPU min MHz: 1200.0000
>
> After that commit [3], the perf differs because of the interrupts
> jumping in the tx path frequently:
> - 98.72% 0.09% xdpsock libc.so.6 [.]
> __libc_sendto
> ▒
> - __libc_sendto
> ◆
> - 98.28% entry_SYSCALL_64_after_hwframe
> ▒
> - 98.19% do_syscall_64
> ▒
> - 97.94% x64_sys_call
> ▒
> - 97.91% __x64_sys_sendto
> ▒
> - 97.80% __sys_sendto
> ▒
> - 97.28% xsk_sendmsg
> ▒
> - 97.18% __xsk_sendmsg.constprop.0.isra.0
> ▒
> - 87.85% __xsk_generic_xmit
> ▒
> - 41.71% xsk_build_skb
> ▒
> - 33.06% sock_alloc_send_pskb
> ▒
> + 22.06% alloc_skb_with_frags
> ▒
> + 7.61% skb_set_owner_w
> ▒
> + 0.58%
> asm_sysvec_call_function_single
> ▒
> 1.66% skb_store_bits
> ▒
> 1.60% memcpy_orig
> ▒
> 0.69% xp_raw_get_data
> ▒
> - 32.56% __dev_direct_xmit
> ▒
> + 15.62% 0xffffffffa064e5ac
> ▒
> - 7.33% __local_bh_enable_ip
> ▒
> + 6.18% do_softirq
> ▒
> + 0.70%
> asm_sysvec_call_function_single
> ▒
> + 5.78% validate_xmit_skb
> ▒
> - 6.08% asm_sysvec_call_function_single
> ▒
> + sysvec_call_function_single
> ▒
> + 1.62% xp_raw_get_data
> ▒
> - 8.29% _raw_spin_lock
> ▒
> + 3.01% asm_sysvec_call_function_single
>
> Prior to that patch, I didn't see any interrupts coming in, so I
> assume the defer part caused the problem.
This is a trade off.
If you are using a single cpu to send packets, then it will not really
have contention on SLUB structures.
A bit like RFS : If you want to reach max throughput on a single flow,
it is probably better _not_ using RFS/RPS,
so that more than one cpu can be involved in the rx work.
We can add a static key to enable/disable the behaviors that are most
suited to a particular workload.
We could also call skb_defer_free_flush() (now it is IRQ safe) from
napi_skb_cache_get() before
we attempt to allocate new sk_buff. This would prevent IPI from being sent.
>
> > In particular which cpu is the bottleneck ?
> >
> > 1) There is still the missing part about tuning NAPI_SKB_CACHE_SIZE /
> > NAPI_SKB_CACHE_BULK, I was hoping you could send the patch we
> > discussed earlier ?
>
> Sure thing :) I've done that part locally, thinking I will post it as
> long as the current series gets merged?
>
> >
> > 2) I am also working on allowing batches of skbs for skb_attempt_defer_free().
> >
> > Another item I am working on is to let the qdisc being serviced
> > preferably not by the cpu performing TX completion,
> > I mentioned about making qdisc->running a sequence that we can latch
> > in __netif_schedule().
> > (Idea is to be able to not spin on qdisc spinlock from net_tx_action()
> > if another cpu was able to call qdisc_run())
>
> Awesome work. Hope to see it very soon :)
>
> Thanks,
> Jason
Powered by blists - more mailing lists