lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7460a188-3a74-4336-ae03-c88e21ffc1ca@nvidia.com>
Date: Wed, 12 Nov 2025 14:03:30 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Eric Dumazet <edumazet@...gle.com>, "David S . Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>
Cc: Simon Horman <horms@...nel.org>, Kuniyuki Iwashima <kuniyu@...gle.com>,
 Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
 eric.dumazet@...il.com,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs

Hi Eric,

On 06/11/2025 20:29, Eric Dumazet wrote:
> There is a lack of NUMA awareness and more generally lack
> of slab caches affinity on TX completion path.
> 
> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
> in per-cpu caches so that they can be recycled in RX path.
> 
> Only use this if the skb was allocated on the same cpu,
> otherwise use skb_attempt_defer_free() so that the skb
> is freed on the original cpu.
> 
> This removes contention on SLUB spinlocks and data structures.
> 
> After this patch, I get ~50% improvement for an UDP tx workload
> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
> 
> 80 Mpps -> 120 Mpps.
> 
> Profiling one of the 32 cpus servicing NIC interrupts :
> 
> Before:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
> 
>      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
> 
> After:
> 
> mpstat -P 511 1 1
> 
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
> 
>      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
> 
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
>   net/core/skbuff.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>   
>   	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>   
> +	if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
> +		skb_release_head_state(skb);
> +		return skb_attempt_defer_free(skb);
> +	}
> +
>   	if (!skb_unref(skb))
>   		return;
>   

I have noticed a suspend regression on one of our Tegra boards. Bisect
is pointing to this commit and reverting this on top of -next fixes the
issue.

Out of all the Tegra boards we test only one is failing and that is the
tegra124-jetson-tk1. This board uses the realtek r8169 driver ...

  r8169 0000:01:00.0: enabling device (0140 -> 0143)
  r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b2:0e, XID 4c0, IRQ 132
  r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]

I don't see any particular crash or error, and even after resuming from
suspend the link does come up ...

  r8169 0000:01:00.0 enp1s0: Link is Down
  tegra-xusb 70090000.usb: Firmware timestamp: 2014-09-16 02:10:07 UTC
  OOM killer enabled.
  Restarting tasks: Starting
  Restarting tasks: Done
  random: crng reseeded on system resumption
  PM: suspend exit
  ata1: SATA link down (SStatus 0 SControl 300)
  r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control rx/tx

However, the board does not seem to resume fully. One thing I should
point out is that for testing we always use an NFS rootfs. So this
would indicate that the link comes up but networking is still having
issues.

Any thoughts?

Jon
  
-- 
nvpublic


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ