netdev - Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d2b52ee5-d7a7-4a97-ba9a-6c99e1470d9b@uwaterloo.ca>
Date: Mon, 25 Aug 2025 13:40:54 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

On 2025-08-25 13:20, Samiullah Khawaja wrote:
> On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-08-24 17:54, Samiullah Khawaja wrote:
>>> Extend the already existing support of threaded napi poll to do continuous
>>> busy polling.
>>>
>>> This is used for doing continuous polling of napi to fetch descriptors
>>> from backing RX/TX queues for low latency applications. Allow enabling
>>> of threaded busypoll using netlink so this can be enabled on a set of
>>> dedicated napis for low latency applications.
>>>
>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>> and set affinity, priority and scheduler for it depending on the
>>> low-latency requirements.
>>>
>>> Currently threaded napi is only enabled at device level using sysfs. Add
>>> support to enable/disable threaded mode for a napi individually. This
>>> can be done using the netlink interface. Extend `napi-set` op in netlink
>>> spec that allows setting the `threaded` attribute of a napi.
>>>
>>> Extend the threaded attribute in napi struct to add an option to enable
>>> continuous busy polling. Extend the netlink and sysfs interface to allow
>>> enabling/disabling threaded busypolling at device or individual napi
>>> level.
>>>
>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>> level latency requirement. For our usecase we want low jitter and stable
>>> latency at P99.
>>>
>>> Following is an analysis and comparison of available (and compatible)
>>> busy poll interfaces for a low latency usecase with stable P99. Please
>>> note that the throughput and cpu efficiency is a non-goal.
>>>
>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
>>> description of the tool and how it tries to simulate the real workload
>>> is following,
>>>
>>> - It sends UDP packets between 2 machines.
>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>     frequency of the packet being sent, we use open-loop sampling. That is
>>>     the packets are sent in a separate thread.
>>> - The server replies to the packet inline by reading the pkt from the
>>>     recv ring and replies using the tx ring.
>>> - To simulate the application processing time, we use a configurable
>>>     delay in usecs on the client side after a reply is received from the
>>>     server.
>>>
>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>
>>> We use this tool with following napi polling configurations,
>>>
>>> - Interrupts only
>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>     packet).
>>> - SO_BUSYPOLL (separate thread and separate core)
> This one uses separate thread and core for polling the napi.

That's not what I am referring to below.

[snip]

>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
>>> |---|---|---|---|---|
>>> | 12 Kpkt/s + 0us delay | | | | |
>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>> | 32 Kpkt/s + 30us delay | | | | |
>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>> | 125 Kpkt/s + 6us delay | | | | |
>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>> | 12 Kpkt/s + 78us delay | | | | |
>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>> | 25 Kpkt/s + 38us delay | | | | |
>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>
>>>    ## Observations
>>
>> Hi Samiullah,
>>
> Thanks for the review
>> I believe you are comparing apples and oranges with these experiments.
>> Because threaded busy poll uses two cores at each end (at 100%), you
> The SO_BUSYPOLL(separate) column is actually running in a separate
> thread and using two cores. So this is actually comparing apples to
> apples.

I am not referring to SO_BUSYPOLL, but to the column labelled 
"interrupts". This is single-core, yes?

>> should compare with 2 pairs of xsk_rr processes using interrupt mode,
>> but each running at half the rate. I am quite certain you would then see
>> the same latency as in the baseline experiment - at much reduced cpu
>> utilization.
>>
>> Threaded busy poll reduces p99 latency by just 100 nsec, while
> The table in the experiments show much larger differences in latency.

Yes, because all but the first experiment add processing delay to 
simulate 100% load and thus most likely show queuing effects.

Since "interrupts" uses just one core and "NAPI threaded" uses two, a 
fair comparison would be for "interrupts" to run two pairs of xsk_rr at 
half the rate each. Then the load would be well below 100%, no queueing, 
and latency would probably go back to the values measured in the "0us 
delay" experiments. At least that's what I would expect.

Reproduction is getting a bit difficult, because you haven't updated the 
xsk_rr RFC and judging from the compilation error, maybe not built/run 
these experiments for a while now? It would be nice to have a working 
reproducible setup.

>> busy-spinning two cores, at each end - not more not less. I continue to
>> believe that this trade-off and these limited benefits need to be
>> clearly and explicitly spelled out in the cover letter.
> Yes, if you just look at the first row of the table then there is
> virtually no difference.
I'm not sure what you mean by this. I compare "interrupts" with "NAPI 
threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why 
I believe the other experiments are not meaningful.

Thanks,
Martin