[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACGkMEt1xybppvu2W42qWfabbsvRdH=1iycoQBOxJ3-+frFW6Q@mail.gmail.com>
Date: Fri, 7 Nov 2025 10:21:53 +0800
From: Jason Wang <jasowang@...hat.com>
To: Nick Hudson <nhudson@...mai.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] tun: use skb_attempt_defer_free in tun_do_read
On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@...mai.com> wrote:
>
> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
> multiqueue (64) tap devices testing has shown contention on the zone lock
> of the page allocator.
>
> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>
> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
> ...
> #
> 100.00%
> |
> |--9.47%--queued_spin_lock_slowpath
> | |
> | --9.37%--_raw_spin_lock_irqsave
> | |
> | |--5.00%--__rmqueue_pcplist
> | | get_page_from_freelist
> | | __alloc_pages_noprof
> | | |
> | | |--3.34%--napi_alloc_skb
> #
>
> That is, for Rx packets
> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
> - vhost-net threads float across CPUs do SKB free.
>
> One method to avoid this contention is to free SKB allocations on the same
> CPU as they were allocated on. This allows freed pages to be placed on the
> per-cpu page (PCP) lists so that any new allocations can be taken directly
> from the PCP list rather than having to request new pages from the page
> allocator (and taking the zone lock).
>
> Fortunately, previous work has provided all the infrastructure to do this
> via the skb_attempt_defer_free call which this change uses instead of
> consume_skb in tun_do_read.
>
> Testing done with a 6.12 based kernel and the patch ported forward.
>
> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
> Load generator: iPerf2 x 1200 clients MSS=400
>
> Before:
> Maximum traffic rate: 55Gbps
>
> After:
> Maximum traffic rate 110Gbps
> ---
> drivers/net/tun.c | 2 +-
> net/core/skbuff.c | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8192740357a0..388f3ffc6657 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
> if (unlikely(ret < 0))
> kfree_skb(skb);
> else
> - consume_skb(skb);
> + skb_attempt_defer_free(skb);
> }
>
> return ret;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6be01454f262..89217c43c639 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
I may miss something but it looks there's no guarantee that the packet
sent to TAP is not shared.
>
> sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
>
> @@ -7221,6 +7222,7 @@ nodefer: kfree_skb_napi_cache(skb);
> if (unlikely(kick))
> kick_defer_list_purge(cpu);
> }
> +EXPORT_SYMBOL(skb_attempt_defer_free);
>
> static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
> size_t offset, size_t len)
> --
> 2.34.1
>
Thanks
Powered by blists - more mailing lists