[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhQwNJuHE6T6caq9Y6DfDqrZo6CTP5ToSDHrcE4wZH_7YQ@mail.gmail.com>
Date: Wed, 15 Jan 2025 14:35:07 -0800
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Joe Damato <jdamato@...tly.com>, Jakub Kicinski <kuba@...nel.org>,
"David S . Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
On Wed, Jan 8, 2025 at 1:54 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-01-08 16:18, Samiullah Khawaja wrote:
> > On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@...tly.com> wrote:
> >>
> >> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
> >>> On Thu, 2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
> >>>> Extend the already existing support of threaded napi poll to do continuous
> >>>> busypolling.
> >>>>
> >>>> This is used for doing continuous polling of napi to fetch descriptors from
> >>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
> >>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
> >>>> low latency applications.
> >>>
> >>> This is lacking clear justification and experimental results
> >>> vs doing the same thing from user space.
> > Thanks for the response.
> >
> > The major benefit is that this is a one common way to enable busy
> > polling of descriptors on a particular napi. It is basically
> > independent of the userspace API and allows for enabling busy polling
> > on a subset, out of the complete list, of napi instances in a device
> > that can be shared among multiple processes/applications that have low
> > latency requirements. This allows for a dedicated subset of napi
> > instances that are configured for busy polling on a machine and
> > workload/jobs can target these napi instances.
> >
> > Once enabled, the relevant kthread can be queried using netlink
> > `get-napi` op. The thread priority, scheduler and any thread core
> > affinity can also be set. Any userspace application using a variety of
> > interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
> > without any further complexity. For userspace driven napi busy
> > polling, one has to either use sysctls to setup busypolling that are
> > done at device level or use different interfaces depending on the use
> > cases,
> > - epoll params (or a sysctl that is system wide) for epoll based interface
> > - socket option (or sysctl that is system wide) for sk_recvmsg
> > - io_uring (I believe SQPOLL can be configured with it)
> >
> > Our application for this feature uses a userspace implementation of
> > TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
> > to send/receive packets. We use neper (running with AF_XDP + userspace
> > TCP library) to measure latency improvements with and without napi
> > threaded busy poll. Our target application sends packets with a well
> > defined frequency with a couple of 100 bytes of RPC style
> > request/response.
>
> Let me also apologize for being late to the party. I am not always
> paying close attention to the list and did not see this until Joe
> flagged it for me.
Thanks for the reply.
>
> I have a couple of questions about your experiments. In general, I find
> this level of experiment description not sufficient for reproducibility.
> Ideally you point to complete scripts.
>
> A canonical problem with using network benchmarks like neper to assess
> network stack processing is that it typically inflates the relative
> importance of network stack processing, because there is not application
> processing involved
Agreed on your assessment and I went back to get some more info before
I could reply to this. Basically our use case is a very low latency, a
solid 14us RPCs with very small messages around 200 bytes with minimum
application processing. We have well defined traffic patterns for this
use case with a defined maximum number of packets per second to make
sure the latency is guaranteed. So to measure the performance of such
a use case, we basically picked up neper and used it to generate our
traffic pattern. While we are using neper, this does represent our
real world use case. Also In my experimentation, I am using neper with
the onload library that I mentioned earlier to interface with the NIC
using AF_XDP. In short we do want to get the maximum network stack
optimization where the packets are pulled off the descriptor queue
quickly..
>
> Were the experiments below run single-threaded?
Since we are waiting on some of the other features in our environment,
we are working with only 1 RX queue that has multiple flows running.
Both experiments have the same interrupt configuration, Also the
userspace process has affinity set to be closer to the core getting
the interrupts.
>
> > Test Environment:
> > Google C3 VMs running netdev-net/main kernel with idpf driver
> >
> > Without napi threaded busy poll (p50 at around 44us)
> > num_transactions=47918
> > latency_min=0.000018838
> > latency_max=0.333912365
> > latency_mean=0.000189570
> > latency_stddev=0.005859874
> > latency_p50=0.000043510
> > latency_p90=0.000053750
> > latency_p99=0.000058230
> > latency_p99.9=0.000184310
>
> What was the interrupt routing in the above base case?
>
> > With napi threaded busy poll (p50 around 14us)
> > latency_min=0.000012271
> > latency_max=0.209365389
> > latency_mean=0.000021611
> > latency_stddev=0.001166541
> > latency_p50=0.000013590
> > latency_p90=0.000019990
> > latency_p99=0.000023670
> > latency_p99.9=0.000027830
>
> How many cores are in play in this case?
Same in userspace. But napi has its own dedicated core polling on it
inside the kernel. Since napi is polled continuously, we don't enable
interrupts for this case as implemented in the patch. This is one of
the major reasons we cannot drive this from userspace and want napi
driven in a separate core independent of the application processing
logic. We don't want the latency drop while the thread that is driving
the napi goes back to userspace and handles some application logic or
packet processing that might be happening in onload.
>
> I am wondering whether your baseline effectively uses only one core, but
> your "threaded busy poll" case uses two? Then I am wondering whether a
> similar effect could be achieved by suitable interrupt and thread affinity?
We tried doing this in earlier experiments by setting up proper
interrupt and thread affinity to make them closer and the 44us latency
is achieved using that. With non AF_XDP tests by enabling busypolling
at socket level using socketopt, we are only able to achieve around
20us, but P99 still suffers. This is mostly because the thread that is
driving the napi goes back to userspace to do application work. This
netlink based mechanism basically solves that and provides a UAPI
independent mechanism to enable busypolling for a napi. One can choose
to configure the napi thread core affinity and priority to share cores
with userspace processes if desired.
>
> Thanks,
> Martin
>
> [snip]
>
Powered by blists - more mailing lists