netdev - Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c596dff4-1e8b-4184-8eb6-590b4da2d92a@intel.com>
Date: Mon, 19 Aug 2024 16:50:52 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>,
	Toke Høiland-Jørgensen <toke@...hat.com>, "Lorenzo
 Bianconi" <lorenzo.bianconi@...hat.com>, Daniel Xu <dxu@...uu.xyz>
CC: Alexander Lobakin <alexandr.lobakin@...el.com>, Alexei Starovoitov
	<ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko
	<andrii@...nel.org>, Larysa Zaremba <larysa.zaremba@...el.com>, "Michal
 Swiatkowski" <michal.swiatkowski@...ux.intel.com>,
	Björn Töpel <bjorn@...nel.org>, Magnus Karlsson
	<magnus.karlsson@...el.com>, Maciej Fijalkowski
	<maciej.fijalkowski@...el.com>, Jonathan Lemon <jonathan.lemon@...il.com>,
	Lorenzo Bianconi <lorenzo@...nel.org>, David Miller <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, "Paolo
 Abeni" <pabeni@...hat.com>, John Fastabend <john.fastabend@...il.com>, "Yajun
 Deng" <yajun.deng@...ux.dev>, Willem de Bruijn <willemb@...gle.com>,
	"bpf@...r.kernel.org" <bpf@...r.kernel.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <xdp-hints@...-project.net>
Subject: Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to
 GRO from netif_receive_skb_list()

From: Jesper Dangaard Brouer <hawk@...nel.org>
Date: Tue, 13 Aug 2024 17:57:44 +0200

> 
> 
> On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
>> Alexander Lobakin <aleksander.lobakin@...el.com> writes:
>>
>>> From: Alexander Lobakin <aleksander.lobakin@...el.com>
>>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>>
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@...hat.com>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,

[...]

>>> I did tests on both threaded NAPI for cpumap and my old implementation
>>> with a traffic generator and I have the following (in Kpps):
>>>
> 
> What kind of traffic is the traffic generator sending?
> 
> E.g. is this a type of traffic that gets GRO aggregated?

Yes. It's UDP, with the UDP GRO enabled on the receiver.

> 
>>>              direct Rx    direct GRO    cpumap    cpumap GRO
>>> baseline    2900         5800          2700      2700 (N/A)
>>> threaded                               2300      4000
>>> old GRO                                2300      4000
>>>
> 
> Nice results. Just to confirm, the units are in Kpps.

Yes. I.e. cpumap was giving 2.7 Mpps without GRO, then 4.0 Mpps with it.

> 
> 
>>> IOW,
>>>
>>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>>>     GRO implementation and my old implementation, but Lorenzo's is also
>>>     a very nice cleanup as it switches cpumap to threaded NAPI
>>> completely
>>>     and the final diffstat even removes more lines than adds, while mine
>>>     adds a bunch of lines and refactors a couple hundred, so I'd go with
>>>     his variant.
>>>
>>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>>>     Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>>>     (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>>>     manually.
>>
>> One question for this: IIUC, the benefit of GRO varies with the traffic
>> mix, depending on how much the GRO logic can actually aggregate. So did
>> you test the pathological case as well (spraying packets over so many
>> flows that there is basically no aggregation taking place)? Just to make
>> sure we don't accidentally screw up performance in that case while
>> optimising for the aggregating case :)
>>
> 
> For the GRO use-case, I think a basic TCP stream throughput test (like
> netperf) should show a benefit once cpumap enable GRO, Can you confirm
> this?

Yes, TCP benefits as well.

> Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
> to fully work, yet?

GRO works well for both TCP and UDP. The main bottleneck is that GRO
calculates the checksum manually on the CPU now, since there's no
checksum status from the NIC.
Also, missing Rx hash means GRO will place packets from every flow into
the same bucket, but it's not a big deal (they get compared layer by
layer anyway).

> 
> Thanks A LOT for doing this benchmarking!

I optimized the code a bit and picked my old patches for bulk NAPI skb
cache allocation and today I got 4.7 Mpps 🎉
IOW, the result of the series (7 patches totally, but 2 are not
networking-related) is 2.7 -> 4.7 Mpps == 75%!

Daniel,

if you want, you can pick my tree[0], either full or just up to

"bpf: cpumap: switch to napi_skb_cache_get_bulk()"

(13 patches total: 6 for netdev_feature_t and 7 for the cpumap)

and test with your usecases. Would be nice to see some real world
results, not my synthetic tests :D

> --Jesper

[0]
https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/

Thanks,
Olek