netdev - Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <772affea-8d44-43ab-81e6-febaf0548da1@uwaterloo.ca>
Date: Tue, 4 Feb 2025 20:32:41 -0500
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>, Jakub Kicinski
 <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 almasrymina@...gle.com
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>
Subject: Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

On 2025-02-04 19:10, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.

[snip]

> Setup:
> 
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>    socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>    using sysctl.
> 
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
> 
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> 		--profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> 		-p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
> 
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                  -p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
> 
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.

As far as I'm concerned, this is still not sufficient information to 
fully assess the experiment. The experiment shows an 162-fold decrease 
in latency and a corresponding increase in throughput for this 
closed-loop workload (which, btw, is different from your open-loop fixed 
rate use case). This would be an extraordinary improvement and that 
alone warrants some scrutiny. 162X means either the base case has a lot 
of idle time or wastes an enormous amount of cpu cycles. How can that be 
explained? It would be good to get some instruction/cycle counts to 
drill down further.

The server process invocation and the actual irq routing is not 
provided. Just stating its common for both experiments is not 
sufficient. Without further information, I still cannot rule out that:

- In the base case, application and napi processing execute on the same 
core and trample on each other. I don't know how onload implements 
epoll_wait, but I worry that it cannot align application processing 
(busy-looping?) and napi processing (also busy-looping?).

- In the threaded busy-loop case, napi processing ends up on one core, 
while the application executes on another one. This uses two cores 
instead of one.

Based on the above, I think at least the following additional scenarios 
need to be investigated:

a) Run the server application in proper fullbusy mode, i.e., cleanly 
alternating between application processing and napi processing. As a 
second step, spread the incoming traffic across two cores to compare 
apples to apples.

b) Run application and napi processing on separate cores, but simply by 
way of thread pinning and interrupt routing. How close does that get to 
the current results? Then selectively add threaded napi and then busy 
polling.

c) Run the whole thing without onload for comparison. The benefits 
should show without onload as well and it's easier to reproduce. Also, I 
suspect onload hurts in the base case and that explains the atrociously 
high latency and low throughput of it.

Or, even better, simply provide a complete specification / script for 
the experiment that makes it easy to reproduce.

Note that I don't dismiss the approach out of hand. I just think it's 
important to properly understand the purported performance improvements. 
At the same time, I don't think it's good for the planet to burn cores 
with busy-looping without good reason.

Thanks,
Martin