netdev - Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <25ujrqfgfkyek2mxh2c2kuuvyt5dyx2e6uysujgv3q43ezab4s@aedwgrlhnvft>
Date: Mon, 25 Nov 2024 14:53:37 -0700
From: Daniel Xu <dxu@...uu.xyz>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Alexander Lobakin <aleksander.lobakin@...el.com>, 
	Lorenzo Bianconi <lorenzo@...nel.org>, "bpf@...r.kernel.org" <bpf@...r.kernel.org>, 
	Jakub Kicinski <kuba@...nel.org>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	John Fastabend <john.fastabend@...il.com>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	David Miller <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org, 
	Lorenzo Bianconi <lorenzo.bianconi@...hat.com>, kernel-team <kernel-team@...udflare.com>, 
	mfleming@...udflare.com
Subject: Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase

Hi Jesper,

On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 25/11/2024 16.12, Alexander Lobakin wrote:
> > From: Daniel Xu <dxu@...uu.xyz>
> > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > 
> > > Hi Olek,
> > > 
> > > Here are the results.
> > > 
> > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > 
> > > > 
> > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > 
> > [...]
> > 
> > > Baseline (again)
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > 
> 
> We need to talk about what we are measuring, and how to control the
> experiment setup to get reproducible results.
> Especially controlling on what CPU cores our code paths are executing.
> 
> In above "baseline" case, we have two processes/tasks executing:
>  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
>  (2) Userspace netserver process TCP receiving data from socket.

"baseline" in this case is still cpumap, just without these GRO patches.

> 
> My experience is that you will see two noticeable different
> throughput performance results depending on whether (1) and (2) is
> executing on the *same* CPU (multi-tasking context-switching),
> or executing in parallel (e.g. pinned) on two different CPU cores.
> 
> The netperf command have an option
> 
>  -T lcpu,remcpu
>       Request that netperf be bound to local CPU lcpu and/or netserver be
> bound to remote CPU rcpu.
> 
> Verify setting by listing pinning like this:
>   for PID in $(pidof netserver); do taskset -pc $PID ; done
> 
> You can also set pinning runtime like this:
>  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> done
> 
> For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> output and adjust pinning runtime to observe the effect quickly.
> 
> My experience is unfortunately that TCP results have a lot of variation
> (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> timing, that can get affected by CPU sleep states. The systems CPU
> latency setting can be seen in /dev/cpu_dma_latency, which can be read
> like this:
> 
>  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> 
> For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> as it requires holding the file open. E.g I play with these profiles:
> 
>  sudo tuned-adm profile throughput-performance
>  sudo tuned-adm profile latency-performance
>  sudo tuned-adm profile network-latency

Appreciate the tips - I should keep this saved somewhere.

> 
> 
> > > cpumap v2 Olek
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > 
> > > 
> 
> 
> We now three processes/tasks executing:
>  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
>  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
>  (3) Userspace netserver process TCP receiving data from socket.
> 
> Again, now the performance is going to depend on depending on which CPU
> cores the processes/tasks are running and whether some are sharing the
> same CPU. (There are both wakeup timing and cache-line effects).
> 
> There are now more combinations to test...
> 
> CPUmap is a CPU scaling facility, and you will likely also see different
> CPU utilization on the difference cores one you start to pin these to
> control the scenarios.
> 
> > > It's very interesting that we see -40% tput w/ the patches. I went back
> > 
> 
> Sad that we see -40% throughput...  but do we know what CPU cores the
> now three different tasks/processes run on(?)
> 

Roughly, yes. For context, my primary use case for cpumap is to provide
some degree of isolation between colocated containers on a single host.
In particular, colocation occurs on AMD Bergamo. And containers are
CPU pinned to their own CCX (roughly). My RX steering program ensures
RX packets destined to a specific container are cpumap redirected to any
of the container's pinned CPUs. It not only provides a good measure of
isolation but ensures resources are properly accounted.

So to answer your question of which CPUs the 3 things run on: cpumap
kthread and application run on the same set of cores. More than that,
they share the same L3 cache by design. irq/softirq is effectively
random given default RSS config and IRQ affinities.


> 
> > Oh no, I messed up something =\
> >  > Could you please also test not the whole series, but patches 1-3 (up to
> > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > array...")? Would be great to see whether this implementation works
> > worse right from the start or I just broke something later on.
> > 
> > > and double checked and it seems the numbers are right. Here's the
> > > some output from some profiles I took with:
> > > 
> > >      perf record -e cycles:k -a -- sleep 10
> > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > 
> > >      # Event 'cycles:k'
> > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > 
> 
> I really appreciate that you provide perf data and perf diff, but as
> described above, we need data and information on what CPU cores are
> running which workload.
> 
> Fortunately perf diff (and perf report) support doing like this:
>  perf diff --sort=cpu,symbol
> 
> But then you also need to control the CPUs used in experiment for the
> diff to work.
> 
> I hope I made sense as these kind of CPU scaling benchmarks are tricky,

Indeed, sounds quite tricky.

My understanding with GRO is that it's a powerful general purpose
optimization. Enough that it should rise above the usual noise on a
reasonably configured system (which mine is).

Maybe we can consider decoupling the cpumap GRO enablement with the
later optimizations?

So in Olek's above series, patches 1-3 seem like they would still
benefit from an simpler testbed. But the more targetted optimizations in
patch 4+ would probably justify a de-noised setup.  Possibly single host
with xdp-trafficgen or something.

Procedurally speaking, maybe it would save some wasted effort if
everyone agreed on the general approach before investing more time into
finer optimizations built on top of the basic GRO support?

Thanks,
Daniel