linux-kernel - Re: [PATCH] tun: use skb_attempt_defer_free in tun_do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACGkMEt1xybppvu2W42qWfabbsvRdH=1iycoQBOxJ3-+frFW6Q@mail.gmail.com>
Date: Fri, 7 Nov 2025 10:21:53 +0800
From: Jason Wang <jasowang@...hat.com>
To: Nick Hudson <nhudson@...mai.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] tun: use skb_attempt_defer_free in tun_do_read

On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@...mai.com> wrote:
>
> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
> multiqueue (64) tap devices testing has shown contention on the zone lock
> of the page allocator.
>
> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>
>     # perf report -i perf.data.vhost --stdio --sort overhead  --no-children | head -22
>     ...
>     #
>        100.00%
>                 |
>                 |--9.47%--queued_spin_lock_slowpath
>                 |          |
>                 |           --9.37%--_raw_spin_lock_irqsave
>                 |                     |
>                 |                     |--5.00%--__rmqueue_pcplist
>                 |                     |          get_page_from_freelist
>                 |                     |          __alloc_pages_noprof
>                 |                     |          |
>                 |                     |          |--3.34%--napi_alloc_skb
>     #
>
> That is, for Rx packets
> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
> - vhost-net threads float across CPUs do SKB free.
>
> One method to avoid this contention is to free SKB allocations on the same
> CPU as they were allocated on. This allows freed pages to be placed on the
> per-cpu page (PCP) lists so that any new allocations can be taken directly
> from the PCP list rather than having to request new pages from the page
> allocator (and taking the zone lock).
>
> Fortunately, previous work has provided all the infrastructure to do this
> via the skb_attempt_defer_free call which this change uses instead of
> consume_skb in tun_do_read.
>
> Testing done with a 6.12 based kernel and the patch ported forward.
>
> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
> Load generator: iPerf2 x 1200 clients MSS=400
>
> Before:
> Maximum traffic rate: 55Gbps
>
> After:
> Maximum traffic rate 110Gbps
> ---
>  drivers/net/tun.c | 2 +-
>  net/core/skbuff.c | 2 ++
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8192740357a0..388f3ffc6657 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>                 if (unlikely(ret < 0))
>                         kfree_skb(skb);
>                 else
> -                       consume_skb(skb);
> +                       skb_attempt_defer_free(skb);
>         }
>
>         return ret;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6be01454f262..89217c43c639 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -7201,6 +7201,7 @@ nodefer:  kfree_skb_napi_cache(skb);
>         DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>         DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>         DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
> +       DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));

I may miss something but it looks there's no guarantee that the packet
sent to TAP is not shared.

>
>         sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
>
> @@ -7221,6 +7222,7 @@ nodefer:  kfree_skb_napi_cache(skb);
>         if (unlikely(kick))
>                 kick_defer_list_purge(cpu);
>  }
> +EXPORT_SYMBOL(skb_attempt_defer_free);
>
>  static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
>                                  size_t offset, size_t len)
> --
> 2.34.1
>

Thanks