[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <772affea-8d44-43ab-81e6-febaf0548da1@uwaterloo.ca>
Date: Tue, 4 Feb 2025 20:32:41 -0500
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>, Jakub Kicinski
<kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
almasrymina@...gle.com
Cc: netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>
Subject: Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
On 2025-02-04 19:10, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
[snip]
> Setup:
>
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
> socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
> using sysctl.
>
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
>
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
>
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
>
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.
As far as I'm concerned, this is still not sufficient information to
fully assess the experiment. The experiment shows an 162-fold decrease
in latency and a corresponding increase in throughput for this
closed-loop workload (which, btw, is different from your open-loop fixed
rate use case). This would be an extraordinary improvement and that
alone warrants some scrutiny. 162X means either the base case has a lot
of idle time or wastes an enormous amount of cpu cycles. How can that be
explained? It would be good to get some instruction/cycle counts to
drill down further.
The server process invocation and the actual irq routing is not
provided. Just stating its common for both experiments is not
sufficient. Without further information, I still cannot rule out that:
- In the base case, application and napi processing execute on the same
core and trample on each other. I don't know how onload implements
epoll_wait, but I worry that it cannot align application processing
(busy-looping?) and napi processing (also busy-looping?).
- In the threaded busy-loop case, napi processing ends up on one core,
while the application executes on another one. This uses two cores
instead of one.
Based on the above, I think at least the following additional scenarios
need to be investigated:
a) Run the server application in proper fullbusy mode, i.e., cleanly
alternating between application processing and napi processing. As a
second step, spread the incoming traffic across two cores to compare
apples to apples.
b) Run application and napi processing on separate cores, but simply by
way of thread pinning and interrupt routing. How close does that get to
the current results? Then selectively add threaded napi and then busy
polling.
c) Run the whole thing without onload for comparison. The benefits
should show without onload as well and it's easier to reproduce. Also, I
suspect onload hurts in the base case and that explains the atrociously
high latency and low throughput of it.
Or, even better, simply provide a complete specification / script for
the experiment that makes it easy to reproduce.
Note that I don't dismiss the approach out of hand. I just think it's
important to properly understand the purported performance improvements.
At the same time, I don't think it's good for the planet to burn cores
with busy-looping without good reason.
Thanks,
Martin
Powered by blists - more mailing lists