lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJBgaoVuQL7jgKwRJd8drpXYTLdGrJUpP9KVrzsPGK_Zw@mail.gmail.com>
Date: Wed, 12 Nov 2025 06:08:04 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jon Hunter <jonathanh@...dia.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, 
	eric.dumazet@...il.com, 
	"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs

On Wed, Nov 12, 2025 at 6:03 AM Jon Hunter <jonathanh@...dia.com> wrote:
>
> Hi Eric,
>
> On 06/11/2025 20:29, Eric Dumazet wrote:
> > There is a lack of NUMA awareness and more generally lack
> > of slab caches affinity on TX completion path.
> >
> > Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> > in per-cpu caches so that they can be recycled in RX path.
> >
> > Only use this if the skb was allocated on the same cpu,
> > otherwise use skb_attempt_defer_free() so that the skb
> > is freed on the original cpu.
> >
> > This removes contention on SLUB spinlocks and data structures.
> >
> > After this patch, I get ~50% improvement for an UDP tx workload
> > on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> >
> > 80 Mpps -> 120 Mpps.
> >
> > Profiling one of the 32 cpus servicing NIC interrupts :
> >
> > Before:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> >
> >      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
> >       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
> >       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
> >       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
> >       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
> >       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
> >       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
> >       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
> >       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
> >       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
> >       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
> >       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
> >       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
> >       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
> >       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
> >       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
> >       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
> >       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
> >       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >
> > After:
> >
> > mpstat -P 511 1 1
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> >
> >      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
> >      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
> >      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
> >      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
> >       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
> >       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
> >       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
> >       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
> >       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
> >       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
> >       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
> >       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
> >       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
> >       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
> >       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
> >       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
> >       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
> >       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
> >       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
> >       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
> >       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
> >       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
> >       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
> >       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
> >       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
> >       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
> >       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
> >       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
> >       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
> >       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
> >       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> >   net/core/skbuff.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
> >
> >       DEBUG_NET_WARN_ON_ONCE(!in_softirq());
> >
> > +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> > +             skb_release_head_state(skb);
> > +             return skb_attempt_defer_free(skb);
> > +     }
> > +
> >       if (!skb_unref(skb))
> >               return;
> >
>
> I have noticed a suspend regression on one of our Tegra boards. Bisect
> is pointing to this commit and reverting this on top of -next fixes the
> issue.
>
> Out of all the Tegra boards we test only one is failing and that is the
> tegra124-jetson-tk1. This board uses the realtek r8169 driver ...
>
>   r8169 0000:01:00.0: enabling device (0140 -> 0143)
>   r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
>   r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
>
> I don't see any particular crash or error, and even after resuming from
> suspend the link does come up ...
>
>   r8169 0000:01:00.0 enp1s0: Link is Down
>   tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
>   OOM killer enabled.
>   Restarting tasks: Starting
>   Restarting tasks: Done
>   random: crng reseeded on system resumption
>   PM: suspend exit
>   ata1: SATA link down (SStatus 0 SControl 300)
>   r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx
>
> However, the board does not seem to resume fully. One thing I should
> point out is that for testing we always use an NFS rootfs. So this
> would indicate that the link comes up but networking is still having
> issues.
>
> Any thoughts?
>
> Jon

Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/

Thanks !

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ