[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3A17D540-12F6-46FE-8109-CCAEBC168754@nutanix.com>
Date: Tue, 6 May 2025 19:11:17 +0000
From: Jon Kohler <jon@...anix.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
CC: "ast@...nel.org" <ast@...nel.org>,
"daniel@...earbox.net"
<daniel@...earbox.net>,
"davem@...emloft.net" <davem@...emloft.net>,
"kuba@...nel.org" <kuba@...nel.org>,
"hawk@...nel.org" <hawk@...nel.org>,
"john.fastabend@...il.com" <john.fastabend@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"bpf@...r.kernel.org"
<bpf@...r.kernel.org>,
"aleksander.lobakin@...el.com"
<aleksander.lobakin@...el.com>,
Jason Wang <jasowang@...hat.com>,
"Michael S.
Tsirkin" <mst@...hat.com>
Subject: Re: [PATCH net-next 0/4] tun: optimize SKB allocation with NAPI cache
> On May 6, 2025, at 10:54 AM, Willem de Bruijn <willemdebruijn.kernel@...il.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> Jon Kohler wrote:
>> Use the per-CPU NAPI cache for SKB allocation, leveraging bulk
>> allocation since the batch size is known at submission time. This
>> improves efficiency by reducing allocation overhead, particularly when
>> using IFF_NAPI and GRO, which can replenish the cache in a tight loop.
>
> Do you have experimental data?
Yes! Sorry I missed to paste it into the cover letter. For the GRO case, I
turned tso off in the guest, which when using iperf3 + TCP puts all of the
traffic down the tun_xdp_one() path, so we get good batching, and GRO
aggregates the payloads back up again.
cmds:
ethtool -K eth0 tso off
taskset -c 2 iperf3 -c other-vm-here -t 30 -p 5200 --bind local-add-here --cport 4200 -b 0 -i 30
Before this series: ~14.4 Gbits/sec
After this series: ~15.2 Gbits/sec
So about a ~5%-ish speedup in that case.
In the UDP case (same syntax, just add a -u), there isn’t any GRO but
we do get a wee bump on pure TX of about ~1%
In mixed TX/RX where there is cache feeding happening from tun_do_read,
we get a bit of a bump too, since the batch allocate doesn’t need to work as
hard.
In pure RX side, there is a bit of a benefit in that path because of bulk
deallocate, so it seems to be a net-win all around from what I’ve seen thus far.
Happy to grab more details if there are other aspects you’re curious about.
Note: In both the TCP non-GSO case and UDP cases, we’d get even more
of a bump if we can figure out the overhead of vhost get_tx_bufs, which is
a ~37% overhead per flame graph. Adding Jason/Michael as FYI on that. I
suspect we could separately do some sort of batched reads there, which
would give us even more room for this series to scream.
>
>> Additionally, utilize napi_build_skb and napi_consume_skb to further
>> benefit from the NAPI cache.
>>
>> Note: This series does not address the large payload path in
>> tun_alloc_skb, which spans sock.c and skbuff.c. A separate series will
>> handle privatizing the allocation code in tun and integrating the NAPI
>> cache for that path.
>>
>> Thanks all,
>> Jon
>>
>> Jon Kohler (4):
>> tun: rcu_deference xdp_prog only once per batch
>> tun: optimize skb allocation in tun_xdp_one
>> tun: use napi_build_skb in __tun_build_skb
>> tun: use napi_consume_skb in tun_do_read
>>
>> drivers/net/tun.c | 60 +++++++++++++++++++++++++++++++++--------------
>> 1 file changed, 42 insertions(+), 18 deletions(-)
>>
>> --
>> 2.43.0
>>
>
>
Powered by blists - more mailing lists