[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJBgaoVuQL7jgKwRJd8drpXYTLdGrJUpP9KVrzsPGK_Zw@mail.gmail.com>
Date: Wed, 12 Nov 2025 06:08:04 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jon Hunter <jonathanh@...dia.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com,
"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
On Wed, Nov 12, 2025 at 6:03 AM Jon Hunter <jonathanh@...dia.com> wrote:
>
> Hi Eric,
>
> On 06/11/2025 20:29, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00
> >
> > 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free
> > 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue
> > 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc
> > 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist
> > 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free
> > 2.11% swapper [kernel.kallsyms] [k] __slab_free
> > 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check
> > 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data
> > 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree
> > 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.03% swapper [kernel.kallsyms] [k] fq_dequeue
> > 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free
> > 0.93% swapper [kernel.kallsyms] [k] read_tsc
> > 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb
> > 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head
> > 0.76% swapper [kernel.kallsyms] [k] idpf_features_check
> > 0.72% swapper [kernel.kallsyms] [k] skb_release_data
> > 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist
> > 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state
> > 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials
> > 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk
> > 0.48% swapper [kernel.kallsyms] [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51
> >
> > 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr
> > 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring
> > 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free
> > 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all
> > 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> > 6.69% swapper [kernel.kallsyms] [k] sock_wfree
> > 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs
> > 3.10% swapper [kernel.kallsyms] [k] fq_dequeue
> > 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state
> > 2.73% swapper [kernel.kallsyms] [k] read_tsc
> > 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start
> > 1.20% swapper [kernel.kallsyms] [k] idpf_features_check
> > 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb
> > 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll
> > 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi
> > 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
> > 0.53% swapper [kernel.kallsyms] [k] io_idle
> > 0.43% swapper [kernel.kallsyms] [k] netif_skb_features
> > 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2
> > 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret
> > 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update
> > 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr
> > 0.34% swapper [kernel.kallsyms] [k] handle_softirqs
> > 0.32% swapper [kernel.kallsyms] [k] net_rx_action
> > 0.32% swapper [kernel.kallsyms] [k] dql_completed
> > 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb
> > 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol
> > 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help
> > 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI
> > 0.28% swapper [kernel.kallsyms] [k] ktime_get
> > 0.24% swapper [kernel.kallsyms] [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> > net/core/skbuff.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> > DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > + if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > + skb_release_head_state(skb);
> > + return skb_attempt_defer_free(skb);
> > + }
> > +
> > if (!skb_unref(skb))
> > return;
> >
>
> I have noticed a suspend regression on one of our Tegra boards. Bisect
> is pointing to this commit and reverting this on top of -next fixes the
> issue.
>
> Out of all the Tegra boards we test only one is failing and that is the
> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>
> r8169 0000:01:00.0: enabling device (0140 -> 0143)
> r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
> r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>
> I don't see any particular crash or error, and even after resuming from
> suspend the link does come up ...
>
> r8169 0000:01:00.0 enp1s0: Link is Down
> tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
> OOM killer enabled.
> Restarting tasks: Starting
> Restarting tasks: Done
> random: crng reseeded on system resumption
> PM: suspend exit
> ata1: SATA link down (SStatus 0 SControl 300)
> r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>
> However, the board does not seem to resume fully. One thing I should
> point out is that for testing we always use an NFS rootfs. So this
> would indicate that the link comes up but networking is still having
> issues.
>
> Any thoughts?
>
> Jon
Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
Thanks !
Powered by blists - more mailing lists