[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8edf6313-a329-4431-a44e-d903d801c771@uwaterloo.ca>
Date: Wed, 15 Jan 2025 19:28:54 -0500
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Joe Damato <jdamato@...tly.com>, Jakub Kicinski <kuba@...nel.org>,
"David S . Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
netdev@...r.kernel.org
Subject: Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
On 2025-01-15 17:35, Samiullah Khawaja wrote:
> On Wed, Jan 8, 2025 at 1:54 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-01-08 16:18, Samiullah Khawaja wrote:
>>> On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@...tly.com> wrote:
>>>>
>>>> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
>>>>> On Thu, 2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
>>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>>> busypolling.
>>>>>>
>>>>>> This is used for doing continuous polling of napi to fetch descriptors from
>>>>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
>>>>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
>>>>>> low latency applications.
>>>>>
>>>>> This is lacking clear justification and experimental results
>>>>> vs doing the same thing from user space.
>>> Thanks for the response.
>>>
>>> The major benefit is that this is a one common way to enable busy
>>> polling of descriptors on a particular napi. It is basically
>>> independent of the userspace API and allows for enabling busy polling
>>> on a subset, out of the complete list, of napi instances in a device
>>> that can be shared among multiple processes/applications that have low
>>> latency requirements. This allows for a dedicated subset of napi
>>> instances that are configured for busy polling on a machine and
>>> workload/jobs can target these napi instances.
>>>
>>> Once enabled, the relevant kthread can be queried using netlink
>>> `get-napi` op. The thread priority, scheduler and any thread core
>>> affinity can also be set. Any userspace application using a variety of
>>> interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
>>> without any further complexity. For userspace driven napi busy
>>> polling, one has to either use sysctls to setup busypolling that are
>>> done at device level or use different interfaces depending on the use
>>> cases,
>>> - epoll params (or a sysctl that is system wide) for epoll based interface
>>> - socket option (or sysctl that is system wide) for sk_recvmsg
>>> - io_uring (I believe SQPOLL can be configured with it)
>>>
>>> Our application for this feature uses a userspace implementation of
>>> TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
>>> to send/receive packets. We use neper (running with AF_XDP + userspace
>>> TCP library) to measure latency improvements with and without napi
>>> threaded busy poll. Our target application sends packets with a well
>>> defined frequency with a couple of 100 bytes of RPC style
>>> request/response.
>>
>> Let me also apologize for being late to the party. I am not always
>> paying close attention to the list and did not see this until Joe
>> flagged it for me.
> Thanks for the reply.
>>
>> I have a couple of questions about your experiments. In general, I find
>> this level of experiment description not sufficient for reproducibility.
>> Ideally you point to complete scripts.
>>
>> A canonical problem with using network benchmarks like neper to assess
>> network stack processing is that it typically inflates the relative
>> importance of network stack processing, because there is not application
>> processing involved
> Agreed on your assessment and I went back to get some more info before
> I could reply to this. Basically our use case is a very low latency, a
> solid 14us RPCs with very small messages around 200 bytes with minimum
> application processing. We have well defined traffic patterns for this
> use case with a defined maximum number of packets per second to make
> sure the latency is guaranteed. So to measure the performance of such
> a use case, we basically picked up neper and used it to generate our
> traffic pattern. While we are using neper, this does represent our
> real world use case. Also In my experimentation, I am using neper with
> the onload library that I mentioned earlier to interface with the NIC
> using AF_XDP. In short we do want to get the maximum network stack
> optimization where the packets are pulled off the descriptor queue
> quickly..
>>
>> Were the experiments below run single-threaded?
> Since we are waiting on some of the other features in our environment,
> we are working with only 1 RX queue that has multiple flows running.
> Both experiments have the same interrupt configuration, Also the
> userspace process has affinity set to be closer to the core getting
> the interrupts.
>>
>>> Test Environment:
>>> Google C3 VMs running netdev-net/main kernel with idpf driver
>>>
>>> Without napi threaded busy poll (p50 at around 44us)
>>> num_transactions=47918
>>> latency_min=0.000018838
>>> latency_max=0.333912365
>>> latency_mean=0.000189570
>>> latency_stddev=0.005859874
>>> latency_p50=0.000043510
>>> latency_p90=0.000053750
>>> latency_p99=0.000058230
>>> latency_p99.9=0.000184310
>>
>> What was the interrupt routing in the above base case?
>>
>>> With napi threaded busy poll (p50 around 14us)
>>> latency_min=0.000012271
>>> latency_max=0.209365389
>>> latency_mean=0.000021611
>>> latency_stddev=0.001166541
>>> latency_p50=0.000013590
>>> latency_p90=0.000019990
>>> latency_p99=0.000023670
>>> latency_p99.9=0.000027830
>>
>> How many cores are in play in this case?
> Same in userspace. But napi has its own dedicated core polling on it
> inside the kernel. Since napi is polled continuously, we don't enable
> interrupts for this case as implemented in the patch. This is one of
> the major reasons we cannot drive this from userspace and want napi
> driven in a separate core independent of the application processing
> logic. We don't want the latency drop while the thread that is driving
> the napi goes back to userspace and handles some application logic or
> packet processing that might be happening in onload.
>>
>> I am wondering whether your baseline effectively uses only one core, but
>> your "threaded busy poll" case uses two? Then I am wondering whether a
>> similar effect could be achieved by suitable interrupt and thread affinity?
> We tried doing this in earlier experiments by setting up proper
> interrupt and thread affinity to make them closer and the 44us latency
> is achieved using that. With non AF_XDP tests by enabling busypolling
> at socket level using socketopt, we are only able to achieve around
> 20us, but P99 still suffers. This is mostly because the thread that is
> driving the napi goes back to userspace to do application work. This
> netlink based mechanism basically solves that and provides a UAPI
> independent mechanism to enable busypolling for a napi. One can choose
> to configure the napi thread core affinity and priority to share cores
> with userspace processes if desired.
Thanks for your explanations. I have a better sense now for your
motivation. However, I can think of several follow-up questions and
what-ifs about your experiments, but rather than going back and forth on
the list, I would find it extremely helpful to see your actual and
complete experiment setups, ideally in script(s), so that one can
reproduce your observations and tinker with variations and what-ifs.
But I'm just a bit of a tourist here, so you might get away with
ignoring me. :-)
Best,
Martin
Powered by blists - more mailing lists