[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d34bf5f5-9626-442d-bdd2-b3ada51d556e@kernel.org>
Date: Fri, 15 Aug 2025 18:40:21 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Jason Xing <kerneljasonxing@...il.com>,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
pabeni@...hat.com, bjorn@...nel.org, magnus.karlsson@...el.com,
jonathan.lemon@...il.com, sdf@...ichev.me, ast@...nel.org,
daniel@...earbox.net, john.fastabend@...il.com, horms@...nel.org,
andrew+netdev@...n.ch, bpf@...r.kernel.org, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode
On 13/08/2025 15.06, Jason Xing wrote:
> On Wed, Aug 13, 2025 at 9:02 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>>
>> On Wed, Aug 13, 2025 at 1:49 AM Maciej Fijalkowski
>> <maciej.fijalkowski@...el.com> wrote:
>>>
>>> On Tue, Aug 12, 2025 at 04:30:03PM +0200, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 11/08/2025 15.12, Jason Xing wrote:
>>>>> From: Jason Xing <kernelxing@...cent.com>
>>>>>
>>>>> Zerocopy mode has a good feature named multi buffer while copy mode has
>>>>> to transmit skb one by one like normal flows. The latter might lose the
>>>>> bypass power to some extent because of grabbing/releasing the same tx
>>>>> queue lock and disabling/enabling bh and stuff on a packet basis.
>>>>> Contending the same queue lock will bring a worse result.
>>>>>
>>>>
>>>> I actually think that it is worth optimizing the non-zerocopy mode for
>>>> AF_XDP. My use-case was virtual net_devices like veth.
>>>>
>>>>
>>>>> This patch supports batch feature by permitting owning the queue lock to
>>>>> send the generic_xmit_batch number of packets at one time. To further
>>>>> achieve a better result, some codes[1] are removed on purpose from
>>>>> xsk_direct_xmit_batch() as referred to __dev_direct_xmit().
>>>>>
>>>>> [1]
>>>>> 1. advance the device check to granularity of sendto syscall.
>>>>> 2. remove validating packets because of its uselessness.
>>>>> 3. remove operation of softnet_data.xmit.recursion because it's not
>>>>> necessary.
>>>>> 4. remove BQL flow control. We don't need to do BQL control because it
>>>>> probably limit the speed. An ideal scenario is to use a standalone and
>>>>> clean tx queue to send packets only for xsk. Less competition shows
>>>>> better performance results.
>>>>>
>>>>> Experiments:
>>>>> 1) Tested on virtio_net:
>>>>
>>>> If you also want to test on veth, then an optimization is to increase
>>>> dev->needed_headroom to XDP_PACKET_HEADROOM (256), as this avoids non-zc
>>>> AF_XDP packets getting reallocated by veth driver. I never completed
>>>> upstreaming this[1] before I left Red Hat. (virtio_net might also benefit)
>>>>
>>>> [1] https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_benchmark04.org
>>>>
>>>>
>>>> (more below...)
>>>>
>>>>> With this patch series applied, the performance number of xdpsock[2] goes
>>>>> up by 33%. Before, it was 767743 pps; while after it was 1021486 pps.
>>>>> If we test with another thread competing the same queue, a 28% increase
>>>>> (from 405466 pps to 521076 pps) can be observed.
>>>>> 2) Tested on ixgbe:
>>>>> The results of zerocopy and copy mode are respectively 1303277 pps and
>>>>> 1187347 pps. After this socket option took effect, copy mode reaches
>>>>> 1472367 which was higher than zerocopy mode impressively.
>>>>>
>>>>> [2]: ./xdpsock -i eth1 -t -S -s 64
>>>>>
>>>>> It's worth mentioning batch process might bring high latency in certain
>>>>> cases. The recommended value is 32.
>>>
>>> Given the issue I spotted on your ixgbe batching patch, the comparison
>>> against zc performance is probably not reliable.
>>
>> I have to clarify the thing: zc performance was tested without that
>> series applied. That means without that series, the number is 1303277
>> pps. What I used is './xdpsock -i enp2s0f0np0 -t -q 11 -z -s 64'.
>
My i40e device is running at 40Gbit/s.
I see significantly higher packet per sec (pps) that you are reporting:
$ sudo ./xdpsock -i i40e2 --txonly -q 2 -z -s 64
sock0@...e2:2 txonly xdp-drv
pps pkts 1.00
rx 0 0
tx 21,546,859 21,552,896
The "copy" mode (-c/--copy) looks like this:
$ sudo ./xdpsock -i i40e2 --txonly -q 2 --copy -s 64
sock0@...e2:2 txonly xdp-drv
pps pkts 1.00
rx 0 0
tx 2,135,424 2,136,192
The skb-mode (-S, --xdp-skb) looks like this:
$ sudo ./xdpsock -i i40e2 --txonly -q 2 --xdp-skb -s 64
sock0@...e2:2 txonly xdp-skb
pps pkts 1.00
rx 0 0
tx 2,187,992 2,188,800
The HUGE performance gap to "xdp-drv" zero-copy mode, tells me that
there is a huge potential for improving the performance for the copy
mode, both "native" xdp-drv and xdp-skb, cases.
Thus, the work your are doing here is important.
> ------
> @Maciej Fijalkowski
> Interesting thing is that copy mode is way worse than zerocopy if the
> nic is _i40e_.
>
> With ixgbe, even copy mode reaches nearly 50-60% of the full speed
> (which is only 1Gb/sec) while either zc or batch version copy mode
> reaches 70%.
> With i40e, copy mode only reaches nearly 9% while zc mode 70%. i40e
> has a much faster speed (10Gb/sec) comparatively.
>
> Here are some summaries (budget 32, batch 32):
> copy batch zc
> i40e 1,777,859 2,102,579 14,880,643
> ixgbe 1,187,347 1,472,367 1,303,277
>
(added thousands separators to make above readable)
Those number only make sense if
i40e runs at link speed 10Git/s and
ixgbe runs at link speed 1Gbit/s
> For i40e, here are numbers around the batch copy mode (budget is 128):
> no batch batch 64
> 1825027 2228328.
> Side note: 2228328 seems to be the max limit in copy mode with this
> series applied after testing with different settings.
>
> It turns out that testing on i40e is definitely needed because the
> xdpsock test hits the bottleneck on ixgbe.
>
> -----
> @Jesper Dangaard Brouer
> In terms of the 'more' boolean as Jesper said, related drivers might
> need changes like this because in the 'for' loop of the batch process
> in xsk_direct_xmit_batch(), drivers may fail to send and then break
> the 'for' loop, which leads to no chance to kick the hardware.
If sending with 'more' indicator and driver fail to send, then it is the
responsibility of the driver to update the tail-ptr/doorbell.
Example for ixgbe driver:
https://elixir.bootlin.com/linux/v6.16/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L8879-L8880
> Or we
> can keep trying to send in xsk_direct_xmit_batch() instead of breaking
> immediately even if the driver becomes busy right now.
>
> As to ixgbe, the performance doesn't improve as I analyzed (because it
> already reaches 70% of full speed).
>
If ixgbe runs at 1Gbit/s then remember Ethernet wire-speed is 1.488
Mpps. Thus, you are much closer than 70% to wire-speed.
> As to i40e, only adding 'more' logic, the number goes from 2102579 to
> 2585224 with the 32 batch and 32 budget settings. The number goes from
> 2200013 to 2697313 with the batch and 64 budget settings. See! 22+%
> improvement!
That is a very big performance gain IMHO. Shows that avoiding the tail-
ptr/doorbell between each packet have a huge benefit.
> As to virtio_net, no obvious change here probably because the hardirq
> logic doesn't have a huge effect.
>
Perhaps virtio_net doesn't implement the SKB 'more' feature?
--Jesper
Powered by blists - more mailing lists