[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5971d10c-c8a3-43e7-88e3-674808ae39a3@uwaterloo.ca>
Date: Wed, 8 Jan 2025 16:53:56 -0500
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>, Joe Damato <jdamato@...tly.com>,
Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
netdev@...r.kernel.org
Subject: Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
On 2025-01-08 16:18, Samiullah Khawaja wrote:
> On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@...tly.com> wrote:
>>
>> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
>>> On Thu, 2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
>>>> Extend the already existing support of threaded napi poll to do continuous
>>>> busypolling.
>>>>
>>>> This is used for doing continuous polling of napi to fetch descriptors from
>>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
>>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
>>>> low latency applications.
>>>
>>> This is lacking clear justification and experimental results
>>> vs doing the same thing from user space.
> Thanks for the response.
>
> The major benefit is that this is a one common way to enable busy
> polling of descriptors on a particular napi. It is basically
> independent of the userspace API and allows for enabling busy polling
> on a subset, out of the complete list, of napi instances in a device
> that can be shared among multiple processes/applications that have low
> latency requirements. This allows for a dedicated subset of napi
> instances that are configured for busy polling on a machine and
> workload/jobs can target these napi instances.
>
> Once enabled, the relevant kthread can be queried using netlink
> `get-napi` op. The thread priority, scheduler and any thread core
> affinity can also be set. Any userspace application using a variety of
> interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
> without any further complexity. For userspace driven napi busy
> polling, one has to either use sysctls to setup busypolling that are
> done at device level or use different interfaces depending on the use
> cases,
> - epoll params (or a sysctl that is system wide) for epoll based interface
> - socket option (or sysctl that is system wide) for sk_recvmsg
> - io_uring (I believe SQPOLL can be configured with it)
>
> Our application for this feature uses a userspace implementation of
> TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
> to send/receive packets. We use neper (running with AF_XDP + userspace
> TCP library) to measure latency improvements with and without napi
> threaded busy poll. Our target application sends packets with a well
> defined frequency with a couple of 100 bytes of RPC style
> request/response.
Let me also apologize for being late to the party. I am not always
paying close attention to the list and did not see this until Joe
flagged it for me.
I have a couple of questions about your experiments. In general, I find
this level of experiment description not sufficient for reproducibility.
Ideally you point to complete scripts.
A canonical problem with using network benchmarks like neper to assess
network stack processing is that it typically inflates the relative
importance of network stack processing, because there is not application
processing involved.
Were the experiments below run single-threaded?
> Test Environment:
> Google C3 VMs running netdev-net/main kernel with idpf driver
>
> Without napi threaded busy poll (p50 at around 44us)
> num_transactions=47918
> latency_min=0.000018838
> latency_max=0.333912365
> latency_mean=0.000189570
> latency_stddev=0.005859874
> latency_p50=0.000043510
> latency_p90=0.000053750
> latency_p99=0.000058230
> latency_p99.9=0.000184310
What was the interrupt routing in the above base case?
> With napi threaded busy poll (p50 around 14us)
> latency_min=0.000012271
> latency_max=0.209365389
> latency_mean=0.000021611
> latency_stddev=0.001166541
> latency_p50=0.000013590
> latency_p90=0.000019990
> latency_p99=0.000023670
> latency_p99.9=0.000027830
How many cores are in play in this case?
I am wondering whether your baseline effectively uses only one core, but
your "threaded busy poll" case uses two? Then I am wondering whether a
similar effect could be achieved by suitable interrupt and thread affinity?
Thanks,
Martin
[snip]
Powered by blists - more mailing lists