netdev - Re: [PATCH net-next 0/4] tun: optimize SKB allocation with NAPI cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3A17D540-12F6-46FE-8109-CCAEBC168754@nutanix.com>
Date: Tue, 6 May 2025 19:11:17 +0000
From: Jon Kohler <jon@...anix.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
CC: "ast@...nel.org" <ast@...nel.org>,
        "daniel@...earbox.net"
	<daniel@...earbox.net>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "hawk@...nel.org" <hawk@...nel.org>,
        "john.fastabend@...il.com" <john.fastabend@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "bpf@...r.kernel.org"
	<bpf@...r.kernel.org>,
        "aleksander.lobakin@...el.com"
	<aleksander.lobakin@...el.com>,
        Jason Wang <jasowang@...hat.com>,
        "Michael S.
 Tsirkin" <mst@...hat.com>
Subject: Re: [PATCH net-next 0/4] tun: optimize SKB allocation with NAPI cache



> On May 6, 2025, at 10:54 AM, Willem de Bruijn <willemdebruijn.kernel@...il.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Jon Kohler wrote:
>> Use the per-CPU NAPI cache for SKB allocation, leveraging bulk
>> allocation since the batch size is known at submission time. This
>> improves efficiency by reducing allocation overhead, particularly when
>> using IFF_NAPI and GRO, which can replenish the cache in a tight loop.
> 
> Do you have experimental data?

Yes! Sorry I missed to paste it into the cover letter. For the GRO case, I
turned tso off in the guest, which when using iperf3 + TCP puts all of the
traffic down the tun_xdp_one() path, so we get good batching, and GRO
aggregates the payloads back up again. 

cmds:
  ethtool -K eth0 tso off
  taskset -c 2 iperf3  -c other-vm-here -t 30 -p 5200 --bind local-add-here --cport 4200 -b 0 -i 30

Before this series: ~14.4 Gbits/sec

After this series: ~15.2 Gbits/sec

So about a ~5%-ish speedup in that case.

In the UDP case (same syntax, just add a -u), there isn’t any GRO but
we do get a wee bump on pure TX of about ~1%

In mixed TX/RX where there is cache feeding happening from tun_do_read,
we get a bit of a bump too, since the batch allocate doesn’t need to work as
hard. 

In pure RX side, there is a bit of a benefit in that path because of bulk
deallocate, so it seems to be a net-win all around from what I’ve seen thus far.

Happy to grab more details if there are other aspects you’re curious about.

Note: In both the TCP non-GSO case and UDP cases, we’d get even more
of a bump if we can figure out the overhead of vhost get_tx_bufs, which is
a ~37% overhead per flame graph. Adding Jason/Michael as FYI on that. I
suspect we could separately do some sort of batched reads there, which
would give us even more room for this series to scream.

> 
>> Additionally, utilize napi_build_skb and napi_consume_skb to further
>> benefit from the NAPI cache.
>> 
>> Note: This series does not address the large payload path in
>> tun_alloc_skb, which spans sock.c and skbuff.c. A separate series will
>> handle privatizing the allocation code in tun and integrating the NAPI
>> cache for that path.
>> 
>> Thanks all,
>> Jon
>> 
>> Jon Kohler (4):
>>  tun: rcu_deference xdp_prog only once per batch
>>  tun: optimize skb allocation in tun_xdp_one
>>  tun: use napi_build_skb in __tun_build_skb
>>  tun: use napi_consume_skb in tun_do_read
>> 
>> drivers/net/tun.c | 60 +++++++++++++++++++++++++++++++++--------------
>> 1 file changed, 42 insertions(+), 18 deletions(-)
>> 
>> -- 
>> 2.43.0
>> 
> 
>