netdev - Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAst1xs=xCLykUoj1=Vj-0LtVyK-qrcDyoy4mQrHgW1kg@mail.gmail.com>
Date: Wed, 13 Aug 2025 21:06:13 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, pabeni@...hat.com, bjorn@...nel.org, 
	magnus.karlsson@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me, 
	ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com, 
	horms@...nel.org, andrew+netdev@...n.ch, bpf@...r.kernel.org, 
	netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
> <maciej.fijalkowski@...el.com> wrote:
> >
> > On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 11/08/2025 15.12, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@...cent.com>
> > > >
> > > > Zerocopy mode has a good feature named multi buffer while copy mode has
> > > > to transmit skb one by one like normal flows. The latter might lose the
> > > > bypass power to some extent because of grabbing/releasing the same tx
> > > > queue lock and disabling/enabling bh and stuff on a packet basis.
> > > > Contending the same queue lock will bring a worse result.
> > > >
> > >
> > > I actually think that it is worth optimizing the non-zerocopy mode for
> > > AF_XDP.  My use-case was virtual net_devices like veth.
> > >
> > >
> > > > This patch supports batch feature by permitting owning the queue lock to
> > > > send the generic_xmit_batch number of packets at one time. To further
> > > > achieve a better result, some codes[1] are removed on purpose from
> > > > xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
> > > >
> > > > [1]
> > > > 1. advance the device check to granularity of sendto syscall.
> > > > 2. remove validating packets because of its uselessness.
> > > > 3. remove operation of softnet_data.xmit.recursion because it's not
> > > >     necessary.
> > > > 4. remove BQL flow control. We don't need to do BQL control because it
> > > >     probably limit the speed. An ideal scenario is to use a standalone and
> > > >     clean tx queue to send packets only for xsk. Less competition shows
> > > >     better performance results.
> > > >
> > > > Experiments:
> > > > 1) Tested on virtio_net:
> > >
> > > If you also want to test on veth, then an optimization is to increase
> > > dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
> > > AF_XDP packets getting reallocated by veth driver. I never completed
> > > upstreaming this[1] before I left Red Hat.  (virtio_net might also benefit)
> > >
> > >  [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
> > >
> > >
> > > (more below...)
> > >
> > > > With this patch series applied, the performance number of xdpsock[2] goes
> > > > up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
> > > > If we test with another thread competing the same queue, a 28% increase
> > > > (from 405466 pps to 521076 pps) can be observed.
> > > > 2) Tested on ixgbe:
> > > > The results of zerocopy and copy mode are respectively 1303277 pps and
> > > > 1187347 pps. After this socket option took effect, copy mode reaches
> > > > 1472367 which was higher than zerocopy mode impressively.
> > > >
> > > > [2]: ./xdpsock -i eth1 -t  -S -s 64
> > > >
> > > > It's worth mentioning batch process might bring high latency in certain
> > > > cases. The recommended value is 32.
> >
> > Given the issue I spotted on your ixgbe batching patch, the comparison
> > against zc performance is probably not reliable.
>
> I have to clarify the thing: zc performance was tested without that
> series applied. That means without that series, the number is 1303277
> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11  -z -s 64'.

------
@Maciej Fijalkowski
Interesting thing is that copy mode is way worse than zerocopy if the
nic is _i40e_.

With ixgbe, even copy mode reaches nearly 50-60% of the full speed
(which is only 1Gb/sec) while either zc or batch version copy mode
reaches 70%.
With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
has a much faster speed (10Gb/sec) comparatively.

Here are some summaries (budget 32, batch 32):
                 copy      batch             zc
i40e      1777859  2102579     14880643
ixgbe    1187347  1472367     1303277

For i40e, here are numbers around the batch copy mode (budget is 128):
no batch       batch 64
1825027      2228328.
Side note: 2228328 seems to be the max limit in copy mode with this
series applied after testing with different settings.

It turns out that testing on i40e is definitely needed because the
xdpsock test hits the bottleneck on ixgbe.

-----
@Jesper Dangaard Brouer
In terms of the 'more' boolean as Jesper said, related drivers might
need changes like this because in the 'for' loop of the batch process
in xsk_direct_xmit_batch(), drivers may fail to send and then break
the 'for' loop, which leads to no chance to kick the hardware. Or we
can keep trying to send in xsk_direct_xmit_batch() instead of breaking
immediately even if the driver becomes busy right now.

As to ixgbe, the performance doesn't improve as I analyzed (because it
already reaches 70% of full speed).

As to i40e, only adding 'more' logic, the number goes from 2102579 to
2585224 with the 32 batch and 32 budget settings. The number goes from
2200013 to 2697313 with the  batch and 64 budget settings. See! 22+%
improvement!

As to virtio_net, no obvious change here probably because the hardirq
logic doesn't have a huge effect.

Thanks,
Jason