netdev - Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhQM4BLXX55Kh0XQ_NqYv8sJVWBfPfSZMb7724_3DrsjjA@mail.gmail.com>
Date: Wed, 5 Feb 2025 12:35:00 -0800
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	netdev@...r.kernel.org, Joe Damato <jdamato@...tly.com>
Subject: Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll

On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
>
> [snip]
>
> > Setup:
> >
> > - Running on Google C3 VMs with idpf driver with following configurations.
> > - IRQ affinity and coalascing is common for both experiments.
> > - There is only 1 RX/TX queue configured.
> > - First experiment enables busy poll using sysctl for both epoll and
> >    socket APIs.
> > - Second experiment enables NAPI threaded busy poll for the full device
> >    using sysctl.
> >
> > Non threaded NAPI busy poll enabled using sysctl.
> > ```
> > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > ```
> >
> > Results using following command,
> > ```
> > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> >               -p 50,90,99,999 -H <IP> -l 10
> >
> > ...
> > ...
> >
> > num_transactions=2835
> > latency_min=0.000018976
> > latency_max=0.049642100
> > latency_mean=0.003243618
> > latency_stddev=0.010636847
> > latency_p50=0.000025270
> > latency_p90=0.005406710
> > latency_p99=0.049807350
> > latency_p99.9=0.049807350
> > ```
> >
> > Results with napi threaded busy poll using following command,
> > ```
> > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> >                  -p 50,90,99,999 -H <IP> -l 10
> >
> > ...
> > ...
> >
> > num_transactions=460163
> > latency_min=0.000015707
> > latency_max=0.200182942
> > latency_mean=0.000019453
> > latency_stddev=0.000720727
> > latency_p50=0.000016950
> > latency_p90=0.000017270
> > latency_p99=0.000018710
> > latency_p99.9=0.000020150
> > ```
> >
> > Here with NAPI threaded busy poll in a separate core, we are able to
> > consistently poll the NAPI to keep latency to absolute minimum. And also
> > we are able to do this without any major changes to the onload stack and
> > threading model.
>
> As far as I'm concerned, this is still not sufficient information to
> fully assess the experiment. The experiment shows an 162-fold decrease
> in latency and a corresponding increase in throughput for this
> closed-loop workload (which, btw, is different from your open-loop fixed
> rate use case). This would be an extraordinary improvement and that
> alone warrants some scrutiny. 162X means either the base case has a lot
> of idle time or wastes an enormous amount of cpu cycles. How can that be
> explained? It would be good to get some instruction/cycle counts to
> drill down further.
The difference is much more apparent (and larger) when I am using more
sockets (50) in this case. I have noticed that the situation gets
worse if I add much more sockets in the mix, but I think this here is
enough to show the effect. The processing of packets on a core and
then going back to userspace to do application work (or protocol
processing in case of onload) is not ok for this use case. If you look
at P50, most of the time there is not much difference, but the tail
latencies add up in the P90 case. I want the descriptors to be pulled
from the NIC queues and handed over right away for processing to a
separate core.
>
> The server process invocation and the actual irq routing is not
> provided. Just stating its common for both experiments is not
> sufficient. Without further information, I still cannot rule out that:
>
> - In the base case, application and napi processing execute on the same
> core and trample on each other. I don't know how onload implements
> epoll_wait, but I worry that it cannot align application processing
> (busy-looping?) and napi processing (also busy-looping?).
>
> - In the threaded busy-loop case, napi processing ends up on one core,
> while the application executes on another one. This uses two cores
> instead of one.
>
> Based on the above, I think at least the following additional scenarios
> need to be investigated:
>
> a) Run the server application in proper fullbusy mode, i.e., cleanly
> alternating between application processing and napi processing. As a
> second step, spread the incoming traffic across two cores to compare
> apples to apples.
This is exactly what is being done in the experiment I posted and it
shows massive degradation of latency when the core is shared between
application processing and napi processing. The busy_read setup above
that I mentioned, makes onload do napi processing when xsk_recvmsg is
called. Also onload spins in the userspace to handle the AF_XDP
queues/rings in memory.
>
> b) Run application and napi processing on separate cores, but simply by
> way of thread pinning and interrupt routing. How close does that get to
> the current results? Then selectively add threaded napi and then busy
> polling.
This was the base case with which we started looking into this work.
And this gives much worse latency because the packets are only picked
from the RX queue on interrupt wakeups (and BH). In fact moving them
to separate cores in this case makes the core getting interrupts be
idle and go to sleep if the frequency of packets is low.
>
> c) Run the whole thing without onload for comparison. The benefits
> should show without onload as well and it's easier to reproduce. Also, I
> suspect onload hurts in the base case and that explains the atrociously
> high latency and low throughput of it.
>
> Or, even better, simply provide a complete specification / script for
> the experiment that makes it easy to reproduce.
That would require setting up onload on the platform you use, provided
it has all the AF_XDP things needed to bring it up. I think I have
provided everything that you would need to set this up on your
platform. I have provided the onload repo, it is open source and it
has README with steps to set it up. I have provided the sysctls
configuration I am using. I have also provided the exact command with
all the arguments I am using to run onload with neper (configuration
and environment including cpu affinity setup).
>
> Note that I don't dismiss the approach out of hand. I just think it's
> important to properly understand the purported performance improvements.
I think the performance improvements are apparent with the data I
provided, I purposefully used more sockets to show the real
differences in tail latency with this revision.

Also one thing that you are probably missing here is that the change
here also has an API aspect, that is it allows a user to drive napi
independent of the user API or protocol being used. I mean I can
certainly drive the napi using recvmsg, but the napi will only be
driven if there is no data in the recv queue. The recvmsg will check
the recv_queue and if it is not empty it will return. This forces the
application to drain the socket and then do napi processing, basically
introducing the same effect of alternating between napi processing and
application processing. The use case to drive the napi in a separate
core (or a couple of threads sharing a single core) is handled cleanly
with this change by enabling it through netlink.
> At the same time, I don't think it's good for the planet to burn cores
> with busy-looping without good reason.
>
> Thanks,
> Martin
>