netdev - Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhRM8wd67DwUttU76+6KrKUki-w9hgkbVskhVG+nJ4JNig@mail.gmail.com>
Date: Wed, 30 Apr 2025 09:58:25 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	willemb@...gle.com, jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-04-28 09:50, Martin Karsten wrote:
> > On 2025-04-24 16:02, Samiullah Khawaja wrote:
>
> [snip]
>
> >> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
> >> threaded |
> >> |---|---|---|---|---|
> >> | 12 Kpkt/s + 0us delay | | | | |
> >> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >> | 32 Kpkt/s + 30us delay | | | | |
> >> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >> | 125 Kpkt/s + 6us delay | | | | |
> >> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >> | 12 Kpkt/s + 78us delay | | | | |
> >> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >> | 25 Kpkt/s + 38us delay | | | | |
> >> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>
> >>   ## Observations
> >>
> >> - Here without application processing all the approaches give the same
> >>    latency within 1usecs range and NAPI threaded gives minimum latency.
> >> - With application processing the latency increases by 3-4usecs when
> >>    doing inline polling.
> >> - Using a dedicated core to drive napi polling keeps the latency same
> >>    even with application processing. This is observed both in userspace
> >>    and threaded napi (in kernel).
> >> - Using napi threaded polling in kernel gives lower latency by
> >>    1-1.5usecs as compared to userspace driven polling in separate core.
> >> - With application processing userspace will get the packet from recv
> >>    ring and spend some time doing application processing and then do napi
> >>    polling. While application processing is happening a dedicated core
> >>    doing napi polling can pull the packet of the NAPI RX queue and
> >>    populate the AF_XDP recv ring. This means that when the application
> >>    thread is done with application processing it has new packets ready to
> >>    recv and process in recv ring.
> >> - Napi threaded busy polling in the kernel with a dedicated core gives
> >>    the consistent P5-P99 latency.
> > I've experimented with this some more. I can confirm latency savings of
> > about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> > when compared to in-thread busy-polling. A few more comments:
Thanks for the experiments and reproducing this. I really appreciate it.
> >
> > 1) I note that the experiment results above show that 'interrupts' is
> > almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> > these results, because I currently only have (very) old hardware
> > available for testing. However, these results worry me in terms of
> > necessity of the threaded busy-polling mechanism - also see Item 4) below.
>
> I want to add one more thought, just to spell this out explicitly:
> Assuming the latency benefits result from better cache utilization of
> two shorter processing loops (NAPI and application) using a dedicated
> core each, it would make sense to see softirq processing on the NAPI
> core being almost as fast. While there might be small penalty for the
> initial hardware interrupt, the following softirq processing does not
The interrupt experiment in the last row demonstrates the penalty you
mentioned. While this effect might be acceptable for some use cases,
it could be problematic in scenarios sensitive to jitter (P99
latency).
> differ much from what a NAPI spin-loop does? The experiments seem to
> corroborate this, because latency results for 'interrupts' and 'NAPI
> threaded' are extremely close.
>
> In this case, it would be essential that interrupt handling happens on a
> dedicated empty core, so it can react to hardware interrupts right away
> and its local cache isn't dirtied by other code than softirq processing.
> While this also means dedicating a entire core to NAPI processing, at
> least the core wouldn't have to spin all the time, hopefully reducing
> power consumption and heat generation.
>
> Thanks,
> Martin
> > 2) The experiments reported here are symmetric in that they use the same
> > polling variant at both the client and the server. When mixing things up
> > by combining different polling variants, it becomes clear that the
> > latency savings are split between both ends. The total savings of 1 usec
> > are thus a combination of 0.5 usec are either end. So the ultimate
> > trade-off is 0.5 usec latency gain for burning 1 core.
> >
> > 3) I believe the savings arise from running two tight loops (separate
> > NAPI and application) instead of one longer loop. The shorter loops
> > likely result in better cache utilization on their respective dedicated
> > cores (and L1 caches). However I am not sure right how to explicitly
> > confirm this.
> >
> > 4) I still believe that the additional experiments with setting both
> > delay and period are meaningless. They create corner cases where rate *
> > delay is about 1. Nobody would run a latency-critical system at 100%
> > load. I also note that the experiment program xsk_rr fails when trying
> > to increase the load beyond saturation (client fails with 'xsk_rr:
> > oustanding array full').
> >
> > 5) I worry that a mechanism like this might be misinterpreted as some
> > kind of magic wand for improving performance and might end up being used
> > in practice and cause substantial overhead without much gain. If
> > accepted, I would hope that this will be documented very clearly and
> > have appropriate warnings attached. Given that the patch cover letter is
> > often used as a basis for documentation, I believe this should be
> > spelled out in the cover letter.
> >
> > With the above in mind, someone else will need to judge whether (at
> > most) 0.5 usec for burning a core is a worthy enough trade-off to
> > justify inclusion of this mechanism. Maybe someone else can take a
> > closer look at the 'interrupts' variant on modern hardware.
> >
> > Thanks,
> > Martin
>