netdev - Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8a7ed7f-af44-4f15-9e30-651a2b9b86ba@uwaterloo.ca>
Date: Wed, 30 Apr 2025 15:57:21 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

On 2025-04-30 12:58, Samiullah Khawaja wrote:
> On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-04-28 09:50, Martin Karsten wrote:
>>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
>>
>> [snip]
>>
>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
>>>> threaded |
>>>> |---|---|---|---|---|
>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>
>>>>    ## Observations
>>>>
>>>> - Here without application processing all the approaches give the same
>>>>     latency within 1usecs range and NAPI threaded gives minimum latency.
>>>> - With application processing the latency increases by 3-4usecs when
>>>>     doing inline polling.
>>>> - Using a dedicated core to drive napi polling keeps the latency same
>>>>     even with application processing. This is observed both in userspace
>>>>     and threaded napi (in kernel).
>>>> - Using napi threaded polling in kernel gives lower latency by
>>>>     1-1.5usecs as compared to userspace driven polling in separate core.
>>>> - With application processing userspace will get the packet from recv
>>>>     ring and spend some time doing application processing and then do napi
>>>>     polling. While application processing is happening a dedicated core
>>>>     doing napi polling can pull the packet of the NAPI RX queue and
>>>>     populate the AF_XDP recv ring. This means that when the application
>>>>     thread is done with application processing it has new packets ready to
>>>>     recv and process in recv ring.
>>>> - Napi threaded busy polling in the kernel with a dedicated core gives
>>>>     the consistent P5-P99 latency.
>>> I've experimented with this some more. I can confirm latency savings of
>>> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
>>> when compared to in-thread busy-polling. A few more comments:
> Thanks for the experiments and reproducing this. I really appreciate it.
>>>
>>> 1) I note that the experiment results above show that 'interrupts' is
>>> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
>>> these results, because I currently only have (very) old hardware
>>> available for testing. However, these results worry me in terms of
>>> necessity of the threaded busy-polling mechanism - also see Item 4) below.
>>
>> I want to add one more thought, just to spell this out explicitly:
>> Assuming the latency benefits result from better cache utilization of
>> two shorter processing loops (NAPI and application) using a dedicated
>> core each, it would make sense to see softirq processing on the NAPI
>> core being almost as fast. While there might be small penalty for the
>> initial hardware interrupt, the following softirq processing does not
> The interrupt experiment in the last row demonstrates the penalty you
> mentioned. While this effect might be acceptable for some use cases,
> it could be problematic in scenarios sensitive to jitter (P99
> latency).

Just to be clear andexplicit: The difference is 200 nsecs for P99 (13200 
vs 13000), i.e, 100 nsecs per core burned on either side. As I mentioned 
before, I don't think the 100%-load experiments (those with nonzero 
delay setting) are representative of any real-world scenario.

Thanks,
Martin

>> differ much from what a NAPI spin-loop does? The experiments seem to
>> corroborate this, because latency results for 'interrupts' and 'NAPI
>> threaded' are extremely close.
>>
>> In this case, it would be essential that interrupt handling happens on a
>> dedicated empty core, so it can react to hardware interrupts right away
>> and its local cache isn't dirtied by other code than softirq processing.
>> While this also means dedicating a entire core to NAPI processing, at
>> least the core wouldn't have to spin all the time, hopefully reducing
>> power consumption and heat generation.
>>
>> Thanks,
>> Martin
>>> 2) The experiments reported here are symmetric in that they use the same
>>> polling variant at both the client and the server. When mixing things up
>>> by combining different polling variants, it becomes clear that the
>>> latency savings are split between both ends. The total savings of 1 usec
>>> are thus a combination of 0.5 usec are either end. So the ultimate
>>> trade-off is 0.5 usec latency gain for burning 1 core.
>>>
>>> 3) I believe the savings arise from running two tight loops (separate
>>> NAPI and application) instead of one longer loop. The shorter loops
>>> likely result in better cache utilization on their respective dedicated
>>> cores (and L1 caches). However I am not sure right how to explicitly
>>> confirm this.
>>>
>>> 4) I still believe that the additional experiments with setting both
>>> delay and period are meaningless. They create corner cases where rate *
>>> delay is about 1. Nobody would run a latency-critical system at 100%
>>> load. I also note that the experiment program xsk_rr fails when trying
>>> to increase the load beyond saturation (client fails with 'xsk_rr:
>>> oustanding array full').
>>>
>>> 5) I worry that a mechanism like this might be misinterpreted as some
>>> kind of magic wand for improving performance and might end up being used
>>> in practice and cause substantial overhead without much gain. If
>>> accepted, I would hope that this will be documented very clearly and
>>> have appropriate warnings attached. Given that the patch cover letter is
>>> often used as a basis for documentation, I believe this should be
>>> spelled out in the cover letter.
>>>
>>> With the above in mind, someone else will need to judge whether (at
>>> most) 0.5 usec for burning a core is a worthy enough trade-off to
>>> justify inclusion of this mechanism. Maybe someone else can take a
>>> closer look at the 'interrupts' variant on modern hardware.
>>>
>>> Thanks,
>>> Martin
>>