[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAywjhQfiWWqE-tpwrVGR9a3uVLbVrSTq7_n_dJGE7c27io7MQ@mail.gmail.com>
Date: Thu, 28 Aug 2025 15:23:43 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com,
willemb@...gle.com, Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll
On Mon, Aug 25, 2025 at 12:45 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-08-25 14:53, Samiullah Khawaja wrote:
> > On Mon, Aug 25, 2025 at 10:41 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
> >>
> >> On 2025-08-25 13:20, Samiullah Khawaja wrote:
> >>> On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
> >>>>
> >>>> On 2025-08-24 17:54, Samiullah Khawaja wrote:
> >>>>> Extend the already existing support of threaded napi poll to do continuous
> >>>>> busy polling.
> >>>>>
> >>>>> This is used for doing continuous polling of napi to fetch descriptors
> >>>>> from backing RX/TX queues for low latency applications. Allow enabling
> >>>>> of threaded busypoll using netlink so this can be enabled on a set of
> >>>>> dedicated napis for low latency applications.
> >>>>>
> >>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>>>> and set affinity, priority and scheduler for it depending on the
> >>>>> low-latency requirements.
> >>>>>
> >>>>> Currently threaded napi is only enabled at device level using sysfs. Add
> >>>>> support to enable/disable threaded mode for a napi individually. This
> >>>>> can be done using the netlink interface. Extend `napi-set` op in netlink
> >>>>> spec that allows setting the `threaded` attribute of a napi.
> >>>>>
> >>>>> Extend the threaded attribute in napi struct to add an option to enable
> >>>>> continuous busy polling. Extend the netlink and sysfs interface to allow
> >>>>> enabling/disabling threaded busypolling at device or individual napi
> >>>>> level.
> >>>>>
> >>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>>>> level latency requirement. For our usecase we want low jitter and stable
> >>>>> latency at P99.
> >>>>>
> >>>>> Following is an analysis and comparison of available (and compatible)
> >>>>> busy poll interfaces for a low latency usecase with stable P99. Please
> >>>>> note that the throughput and cpu efficiency is a non-goal.
> >>>>>
> >>>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> >>>>> description of the tool and how it tries to simulate the real workload
> >>>>> is following,
> >>>>>
> >>>>> - It sends UDP packets between 2 machines.
> >>>>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>>> frequency of the packet being sent, we use open-loop sampling. That is
> >>>>> the packets are sent in a separate thread.
> >>>>> - The server replies to the packet inline by reading the pkt from the
> >>>>> recv ring and replies using the tx ring.
> >>>>> - To simulate the application processing time, we use a configurable
> >>>>> delay in usecs on the client side after a reply is received from the
> >>>>> server.
> >>>>>
> >>>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>>>>
> >>>>> We use this tool with following napi polling configurations,
> >>>>>
> >>>>> - Interrupts only
> >>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>>> packet).
> >>>>> - SO_BUSYPOLL (separate thread and separate core)
> >>> This one uses separate thread and core for polling the napi.
> >>
> >> That's not what I am referring to below.
> >>
> >> [snip]
> >>
> >>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> >>>>> |---|---|---|---|---|
> >>>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>>
> >>>>> ## Observations
> >>>>
> >>>> Hi Samiullah,
> >>>>
> >>> Thanks for the review
> >>>> I believe you are comparing apples and oranges with these experiments.
> >>>> Because threaded busy poll uses two cores at each end (at 100%), you
> >>> The SO_BUSYPOLL(separate) column is actually running in a separate
> >>> thread and using two cores. So this is actually comparing apples to
> >>> apples.
> >>
> >> I am not referring to SO_BUSYPOLL, but to the column labelled
> >> "interrupts". This is single-core, yes?
Not really. The interrupts are pinned to a different CPU and the
process (and it's threads) run a different CPU. Please check the cover
letter for interrupt and process affinity configurations..
> >>
> >>>> should compare with 2 pairs of xsk_rr processes using interrupt mode,
> >>>> but each running at half the rate. I am quite certain you would then see
> >>>> the same latency as in the baseline experiment - at much reduced cpu
> >>>> utilization.
> >>>>
> >>>> Threaded busy poll reduces p99 latency by just 100 nsec, while
> >>> The table in the experiments show much larger differences in latency.
> >>
> >> Yes, because all but the first experiment add processing delay to
> >> simulate 100% load and thus most likely show queuing effects.
> >>
> >> Since "interrupts" uses just one core and "NAPI threaded" uses two, a
> >> fair comparison would be for "interrupts" to run two pairs of xsk_rr at
> >> half the rate each. Then the load would be well below 100%, no queueing,
> >> and latency would probably go back to the values measured in the "0us
> >> delay" experiments. At least that's what I would expect.
> > Two set of xsk_rr will go to two different NIC queues with two
> > different interrupts (I think). That would be comparing apples to
> > oranges, as all the other columns use a single NIC queue. Having
> > (Forcing user to have) two xsk sockets to deliver packets at a certain
> > rate is a completely different use case.
>
> I don't think a NIC queue is a more critical resource than a CPU core?
>
> And the rest depends on the actual application that would be using the
> service. The restriction to xsk_rr and its particulars is because that's
> the benchmark you provided.
> >> Reproduction is getting a bit difficult, because you haven't updated the
> >> xsk_rr RFC and judging from the compilation error, maybe not built/run
> >> these experiments for a while now? It would be nice to have a working
> >> reproducible setup.
> > Oh. Let me check the xsk_rr and see whether it is outdated. I will
> > send out another RFC for it if it's outdated.
I checked this, it seems the last xsk_rr needs to be rebased. Will be
sending it out shortly.
> >>
> >>>> busy-spinning two cores, at each end - not more not less. I continue to
> >>>> believe that this trade-off and these limited benefits need to be
> >>>> clearly and explicitly spelled out in the cover letter.
> >>> Yes, if you just look at the first row of the table then there is
> >>> virtually no difference.
> >> I'm not sure what you mean by this. I compare "interrupts" with "NAPI
> >> threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why
> >> I believe the other experiments are not meaningful.
> > Yes that is exactly what I am disagreeing with. I don't think other
> > rows are "not meaningful". The xsk_rr is trying to "simulate the
> > application processing" by adding a cpu delay and the table clearly
> > shows the comparison between various mechanisms and how they perform
> > with in load.
>
> But these experiments only look at cases with almost exactly 100% load.
> As I mentioned in a previous round, this is highly unlikely for a
> latency-critical service and thus it seems contrived. Once you go to
> 100% load and see queueing effects, you also need to look left and right
> to investigate other load and system settings.
>
> Maybe this means the xsk_rr tool is not a good enough benchmark?
>
> Thanks,
> Martin
>
Powered by blists - more mailing lists