netdev - Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAywjhRnV9t_64DA1XLuDx89u2oMSEep0RCYO84YRKn5PxsUkA@mail.gmail.com>
Date: Wed, 30 Apr 2025 13:33:20 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	willemb@...gle.com, jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

On Wed, Apr 30, 2025 at 12:57 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-04-30 12:58, Samiullah Khawaja wrote:
> > On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
> >>
> >> On 2025-04-28 09:50, Martin Karsten wrote:
> >>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
> >>
> >> [snip]
> >>
> >>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
> >>>> threaded |
> >>>> |---|---|---|---|---|
> >>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>
> >>>>    ## Observations
> >>>>
> >>>> - Here without application processing all the approaches give the same
> >>>>     latency within 1usecs range and NAPI threaded gives minimum latency.
> >>>> - With application processing the latency increases by 3-4usecs when
> >>>>     doing inline polling.
> >>>> - Using a dedicated core to drive napi polling keeps the latency same
> >>>>     even with application processing. This is observed both in userspace
> >>>>     and threaded napi (in kernel).
> >>>> - Using napi threaded polling in kernel gives lower latency by
> >>>>     1-1.5usecs as compared to userspace driven polling in separate core.
> >>>> - With application processing userspace will get the packet from recv
> >>>>     ring and spend some time doing application processing and then do napi
> >>>>     polling. While application processing is happening a dedicated core
> >>>>     doing napi polling can pull the packet of the NAPI RX queue and
> >>>>     populate the AF_XDP recv ring. This means that when the application
> >>>>     thread is done with application processing it has new packets ready to
> >>>>     recv and process in recv ring.
> >>>> - Napi threaded busy polling in the kernel with a dedicated core gives
> >>>>     the consistent P5-P99 latency.
> >>> I've experimented with this some more. I can confirm latency savings of
> >>> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> >>> when compared to in-thread busy-polling. A few more comments:
> > Thanks for the experiments and reproducing this. I really appreciate it.
> >>>
> >>> 1) I note that the experiment results above show that 'interrupts' is
> >>> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> >>> these results, because I currently only have (very) old hardware
> >>> available for testing. However, these results worry me in terms of
> >>> necessity of the threaded busy-polling mechanism - also see Item 4) below.
> >>
> >> I want to add one more thought, just to spell this out explicitly:
> >> Assuming the latency benefits result from better cache utilization of
> >> two shorter processing loops (NAPI and application) using a dedicated
> >> core each, it would make sense to see softirq processing on the NAPI
> >> core being almost as fast. While there might be small penalty for the
> >> initial hardware interrupt, the following softirq processing does not
> > The interrupt experiment in the last row demonstrates the penalty you
> > mentioned. While this effect might be acceptable for some use cases,
> > it could be problematic in scenarios sensitive to jitter (P99
> > latency).
>
> Just to be clear andexplicit: The difference is 200 nsecs for P99 (13200
> vs 13000), i.e, 100 nsecs per core burned on either side. As I mentioned
> before, I don't think the 100%-load experiments (those with nonzero
> delay setting) are representative of any real-world scenario.
oh.. you are only considering the first row. Yes, with zero delay it
would (mostly) be equal. I agree with you that there is very little
difference in that particular scenario.
>
> Thanks,
> Martin
>
> >> differ much from what a NAPI spin-loop does? The experiments seem to
> >> corroborate this, because latency results for 'interrupts' and 'NAPI
> >> threaded' are extremely close.
> >>
> >> In this case, it would be essential that interrupt handling happens on a
> >> dedicated empty core, so it can react to hardware interrupts right away
> >> and its local cache isn't dirtied by other code than softirq processing.
> >> While this also means dedicating a entire core to NAPI processing, at
> >> least the core wouldn't have to spin all the time, hopefully reducing
> >> power consumption and heat generation.
> >>
> >> Thanks,
> >> Martin
> >>> 2) The experiments reported here are symmetric in that they use the same
> >>> polling variant at both the client and the server. When mixing things up
> >>> by combining different polling variants, it becomes clear that the
> >>> latency savings are split between both ends. The total savings of 1 usec
> >>> are thus a combination of 0.5 usec are either end. So the ultimate
> >>> trade-off is 0.5 usec latency gain for burning 1 core.
> >>>
> >>> 3) I believe the savings arise from running two tight loops (separate
> >>> NAPI and application) instead of one longer loop. The shorter loops
> >>> likely result in better cache utilization on their respective dedicated
> >>> cores (and L1 caches). However I am not sure right how to explicitly
> >>> confirm this.
> >>>
> >>> 4) I still believe that the additional experiments with setting both
> >>> delay and period are meaningless. They create corner cases where rate *
> >>> delay is about 1. Nobody would run a latency-critical system at 100%
> >>> load. I also note that the experiment program xsk_rr fails when trying
> >>> to increase the load beyond saturation (client fails with 'xsk_rr:
> >>> oustanding array full').
> >>>
> >>> 5) I worry that a mechanism like this might be misinterpreted as some
> >>> kind of magic wand for improving performance and might end up being used
> >>> in practice and cause substantial overhead without much gain. If
> >>> accepted, I would hope that this will be documented very clearly and
> >>> have appropriate warnings attached. Given that the patch cover letter is
> >>> often used as a basis for documentation, I believe this should be
> >>> spelled out in the cover letter.
> >>>
> >>> With the above in mind, someone else will need to judge whether (at
> >>> most) 0.5 usec for burning a core is a worthy enough trade-off to
> >>> justify inclusion of this mechanism. Maybe someone else can take a
> >>> closer look at the 'interrupts' variant on modern hardware.
> >>>
> >>> Thanks,
> >>> Martin
> >>
>