[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a91042b-4406-4b99-99c5-6ec1ec7b98d7@intel.com>
Date: Wed, 21 Aug 2024 15:16:51 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Daniel Xu <dxu@...uu.xyz>
CC: Jesper Dangaard Brouer <hawk@...nel.org>,
Toke Høiland-Jørgensen <toke@...hat.com>, "Lorenzo
Bianconi" <lorenzo.bianconi@...hat.com>, Alexander Lobakin
<alexandr.lobakin@...el.com>, Alexei Starovoitov <ast@...nel.org>, "Daniel
Borkmann" <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, Larysa
Zaremba <larysa.zaremba@...el.com>, Michal Swiatkowski
<michal.swiatkowski@...ux.intel.com>, Björn Töpel
<bjorn@...nel.org>, Magnus Karlsson <magnus.karlsson@...el.com>, "Maciej
Fijalkowski" <maciej.fijalkowski@...el.com>, Jonathan Lemon
<jonathan.lemon@...il.com>, Lorenzo Bianconi <lorenzo@...nel.org>, "David
Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, "Jakub
Kicinski" <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, John Fastabend
<john.fastabend@...il.com>, Yajun Deng <yajun.deng@...ux.dev>, "Willem de
Bruijn" <willemb@...gle.com>, "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<xdp-hints@...-project.net>
Subject: Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to
GRO from netif_receive_skb_list()
From: Daniel Xu <dxu@...uu.xyz>
Date: Tue, 20 Aug 2024 17:29:45 -0700
> Hi Olek,
>
> On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> [..]
>>> Thanks A LOT for doing this benchmarking!
>>
>> I optimized the code a bit and picked my old patches for bulk NAPI skb
>> cache allocation and today I got 4.7 Mpps 🎉
>> IOW, the result of the series (7 patches totally, but 2 are not
>> networking-related) is 2.7 -> 4.7 Mpps == 75%!
>>
>> Daniel,
>>
>> if you want, you can pick my tree[0], either full or just up to
>>
>> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
>>
>> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
>>
>> and test with your usecases. Would be nice to see some real world
>> results, not my synthetic tests :D
>>
>>> --Jesper
>>
>> [0]
>> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
>
> So it turns out keeping the workload in place while I update and reboot
> the kernel is a Hard Problem. I'll put in some more effort and see if I
> can get one of the workloads to stay still, but it'll be a somewhat
> noisy test even if it works. So the following are synthetic tests
> (neper) but on a real prod setup as far as container networking and
> configuration is concerned.
>
> I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
> skip some of the flag refactors b/c of conflicts - I didn't know the
> code well enough to do fixups. So I had to apply this diff (FWIW not sure
> the struct_size() here was right anyhow):
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 089d19c62efe..359fbfaa43eb 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> if (!cmap->cpu_map)
> goto free_cmap;
>
> - dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
> + dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
Hmm, it will allocate the same amount of memory. Why do you need this?
Are you running these patches on some older kernel which doesn't have a
proper flex array at the end of &net_device?
> if (!dev)
> goto free_cpu_map;
>
>
> ==== Baseline ===
> ./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $SERVER -T8 -F16 -l30
>
> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> Run 1 2578189 0.00008831 0.00010623 0.00013439 Run 1 15427.22
> Run 2 2657923 0.00008575 0.00010239 0.00012927 Run 2 15272.12
> Run 3 2700402 0.00008447 0.00010111 0.00013183 Run 3 14871.35
> Run 4 2571739 0.00008575 0.00011519 0.00013823 Run 4 15344.72
> Run 5 2476427 0.00008703 0.00013055 0.00016895 Run 5 15193.2
> Average 2596936 0.000086262 0.000111094 0.000140534 Average 15221.722
>
> === cpumap NAPI patches ===
> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> Run 1 2554598 0.00008703 0.00011263 0.00013055 Run 1 17090.29
> Run 2 2478905 0.00009087 0.00011391 0.00014463 Run 2 16742.27
> Run 3 2418599 0.00009471 0.00011007 0.00014207 Run 3 17555.3
> Run 4 2562463 0.00008959 0.00010367 0.00013055 Run 4 17892.3
> Run 5 2716551 0.00008127 0.00010879 0.00013439 Run 5 17578.32
> Average 2546223.2 0.000088694 0.000109814 0.000136438 Average 17371.696
> Delta -1.95% 2.82% -1.15% -2.91% 14.12%
>
>
> So it looks like the GRO patches work quite well out of the box. It's
> curious that tcp_rr transactions go down a bit, though. I don't have any
> intuition around that.
14% is quite nice I'd say. Is this first table taken from the cpumap as
well or just direct Rx?
>
> Lemme know if you wanna change some stuff and get a rerun.
>
> Thanks,
> Daniel
Thanks,
Olek
Powered by blists - more mailing lists