lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251106155008.879042-1-nhudson@akamai.com>
Date: Thu,  6 Nov 2025 15:50:07 +0000
From: Nick Hudson <nhudson@...mai.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Jason Wang <jasowang@...hat.com>, Andrew Lunn <andrew+netdev@...n.ch>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>
Cc: Nick Hudson <nhudson@...mai.com>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: [PATCH] tun: use skb_attempt_defer_free in tun_do_read

On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
multiqueue (64) tap devices testing has shown contention on the zone lock
of the page allocator.

A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows

    # perf report -i perf.data.vhost --stdio --sort overhead  --no-children | head -22
    ...
    #
       100.00%
                |
                |--9.47%--queued_spin_lock_slowpath
                |          |
                |           --9.37%--_raw_spin_lock_irqsave
                |                     |
                |                     |--5.00%--__rmqueue_pcplist
                |                     |          get_page_from_freelist
                |                     |          __alloc_pages_noprof
                |                     |          |
                |                     |          |--3.34%--napi_alloc_skb
    #

That is, for Rx packets
- ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
- vhost-net threads float across CPUs do SKB free.

One method to avoid this contention is to free SKB allocations on the same
CPU as they were allocated on. This allows freed pages to be placed on the
per-cpu page (PCP) lists so that any new allocations can be taken directly
from the PCP list rather than having to request new pages from the page
allocator (and taking the zone lock).

Fortunately, previous work has provided all the infrastructure to do this
via the skb_attempt_defer_free call which this change uses instead of
consume_skb in tun_do_read.

Testing done with a 6.12 based kernel and the patch ported forward.

Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
Load generator: iPerf2 x 1200 clients MSS=400

Before:
Maximum traffic rate: 55Gbps

After:
Maximum traffic rate 110Gbps
---
 drivers/net/tun.c | 2 +-
 net/core/skbuff.c | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..388f3ffc6657 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		if (unlikely(ret < 0))
 			kfree_skb(skb);
 		else
-			consume_skb(skb);
+                       skb_attempt_defer_free(skb);
 	}
 
 	return ret;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6be01454f262..89217c43c639 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7201,6 +7201,7 @@ nodefer:	kfree_skb_napi_cache(skb);
 	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
 	DEBUG_NET_WARN_ON_ONCE(skb->destructor);
 	DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
+	DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
 
 	sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
 
@@ -7221,6 +7222,7 @@ nodefer:	kfree_skb_napi_cache(skb);
 	if (unlikely(kick))
 		kick_defer_list_purge(cpu);
 }
+EXPORT_SYMBOL(skb_attempt_defer_free);
 
 static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
 				 size_t offset, size_t len)
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ