[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <681b5f5983645_1e4406294c4@willemb.c.googlers.com.notmuch>
Date: Wed, 07 May 2025 09:25:45 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jon Kohler <jon@...anix.com>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: "ast@...nel.org" <ast@...nel.org>,
"daniel@...earbox.net" <daniel@...earbox.net>,
"davem@...emloft.net" <davem@...emloft.net>,
"kuba@...nel.org" <kuba@...nel.org>,
"hawk@...nel.org" <hawk@...nel.org>,
"john.fastabend@...il.com" <john.fastabend@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"aleksander.lobakin@...el.com" <aleksander.lobakin@...el.com>,
Jason Wang <jasowang@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>
Subject: Re: [PATCH net-next 0/4] tun: optimize SKB allocation with NAPI cache
Jon Kohler wrote:
>
>
> > On May 6, 2025, at 10:54 AM, Willem de Bruijn <willemdebruijn.kernel@...il.com> wrote:
> >
> > !-------------------------------------------------------------------|
> > CAUTION: External Email
> >
> > |-------------------------------------------------------------------!
> >
> > Jon Kohler wrote:
> >> Use the per-CPU NAPI cache for SKB allocation, leveraging bulk
> >> allocation since the batch size is known at submission time. This
> >> improves efficiency by reducing allocation overhead, particularly when
> >> using IFF_NAPI and GRO, which can replenish the cache in a tight loop.
> >
> > Do you have experimental data?
>
> Yes! Sorry I missed to paste it into the cover letter. For the GRO case, I
> turned tso off in the guest, which when using iperf3 + TCP puts all of the
> traffic down the tun_xdp_one() path, so we get good batching, and GRO
> aggregates the payloads back up again.
>
> cmds:
> ethtool -K eth0 tso off
> taskset -c 2 iperf3 -c other-vm-here -t 30 -p 5200 --bind local-add-here --cport 4200 -b 0 -i 30
>
> Before this series: ~14.4 Gbits/sec
>
> After this series: ~15.2 Gbits/sec
>
> So about a ~5%-ish speedup in that case.
>
> In the UDP case (same syntax, just add a -u), there isn’t any GRO but
> we do get a wee bump on pure TX of about ~1%
>
> In mixed TX/RX where there is cache feeding happening from tun_do_read,
> we get a bit of a bump too, since the batch allocate doesn’t need to work as
> hard.
>
> In pure RX side, there is a bit of a benefit in that path because of bulk
> deallocate, so it seems to be a net-win all around from what I’ve seen thus far.
>
> Happy to grab more details if there are other aspects you’re curious about.
Thanks! No this is great. Let's definitely capture this in the
relevant patch or commit message. I'll take a closer look at the
implementation now.
> Note: In both the TCP non-GSO case and UDP cases, we’d get even more
> of a bump if we can figure out the overhead of vhost get_tx_bufs, which is
> a ~37% overhead per flame graph. Adding Jason/Michael as FYI on that. I
> suspect we could separately do some sort of batched reads there, which
> would give us even more room for this series to scream.
Powered by blists - more mailing lists