[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7460a188-3a74-4336-ae03-c88e21ffc1ca@nvidia.com>
Date: Wed, 12 Nov 2025 14:03:30 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Eric Dumazet <edumazet@...gle.com>, "David S . Miller"
<davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>
Cc: Simon Horman <horms@...nel.org>, Kuniyuki Iwashima <kuniyu@...gle.com>,
Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com,
"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
Hi Eric,
On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
>
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
>
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
>
> This removes contention on SLUB spinlocks and data structures.
>
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>
> 80 Mpps -> 120 Mpps.
>
> Profiling one of the 32 cpus servicing NIC interrupts :
>
> Before:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
>
> 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> 2.11% swapper [kernel.kallsyms] [k] __slab_free
> 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> 0.93% swapper [kernel.kallsyms] [k] read_tsc
> 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> 0.48% swapper [kernel.kallsyms] [k] sock_wfree
>
> After:
>
> mpstat -P 511 1 1
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
>
> 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> 2.73% swapper [kernel.kallsyms] [k] read_tsc
> 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> 0.53% swapper [kernel.kallsyms] [k] io_idle
> 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> 0.32% swapper [kernel.kallsyms] [k] dql_completed
> 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> 0.28% swapper [kernel.kallsyms] [k] ktime_get
> 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
> net/core/skbuff.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>
> DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>
> + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> + skb_release_head_state(skb);
> + return skb_attempt_defer_free(skb);
> + }
> +
> if (!skb_unref(skb))
> return;
>
I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.
Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
r8169 0000:01:00.0: enabling device (0140 -> 0143)
r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...
r8169 0000:01:00.0 enp1s0: Link is Down
tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
OOM killer enabled.
Restarting tasks: Starting
Restarting tasks: Done
random: crng reseeded on system resumption
PM: suspend exit
ata1: SATA link down (SStatus 0 SControl 300)
r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.
Any thoughts?
Jon
--
nvpublic
Powered by blists - more mailing lists