netdev - Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6441ae17-1b7d-4a8d-96d0-f526f410346c@uwaterloo.ca>
Date: Mon, 25 Aug 2025 16:21:53 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

[snip]

>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | 
>>>>>> NAPI threaded |
>>>>>> |---|---|---|---|---|
>>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>>>
>>>>>>     ## Observations
>>>>>
>>>>> Hi Samiullah,
>>>>>
>>>> Thanks for the review
>>>>> I believe you are comparing apples and oranges with these experiments.
>>>>> Because threaded busy poll uses two cores at each end (at 100%), you
>>>> The SO_BUSYPOLL(separate) column is actually running in a separate
>>>> thread and using two cores. So this is actually comparing apples to
>>>> apples.
>>>
>>> I am not referring to SO_BUSYPOLL, but to the column labelled
>>> "interrupts". This is single-core, yes?
>>>
>>>>> should compare with 2 pairs of xsk_rr processes using interrupt mode,
>>>>> but each running at half the rate. I am quite certain you would 
>>>>> then see
>>>>> the same latency as in the baseline experiment - at much reduced cpu
>>>>> utilization.
>>>>>
>>>>> Threaded busy poll reduces p99 latency by just 100 nsec, while
>>>> The table in the experiments show much larger differences in latency.
>>>
>>> Yes, because all but the first experiment add processing delay to
>>> simulate 100% load and thus most likely show queuing effects.
>>>
>>> Since "interrupts" uses just one core and "NAPI threaded" uses two, a
>>> fair comparison would be for "interrupts" to run two pairs of xsk_rr at
>>> half the rate each. Then the load would be well below 100%, no queueing,
>>> and latency would probably go back to the values measured in the "0us
>>> delay" experiments. At least that's what I would expect.
>> Two set of xsk_rr will go to two different NIC queues with two
>> different interrupts (I think). That would be comparing apples to
>> oranges, as all the other columns use a single NIC queue. Having
>> (Forcing user to have) two xsk sockets to deliver packets at a certain
>> rate is a completely different use case.
> 
> I don't think a NIC queue is a more critical resource than a CPU core?
> 
> And the rest depends on the actual application that would be using the 
> service. The restriction to xsk_rr and its particulars is because that's 
> the benchmark you provided.
>>> Reproduction is getting a bit difficult, because you haven't updated the
>>> xsk_rr RFC and judging from the compilation error, maybe not built/run
>>> these experiments for a while now? It would be nice to have a working
>>> reproducible setup.
>> Oh. Let me check the xsk_rr and see whether it is outdated. I will
>> send out another RFC for it if it's outdated.
>>>
>>>>> busy-spinning two cores, at each end - not more not less. I 
>>>>> continue to
>>>>> believe that this trade-off and these limited benefits need to be
>>>>> clearly and explicitly spelled out in the cover letter.
>>>> Yes, if you just look at the first row of the table then there is
>>>> virtually no difference.
>>> I'm not sure what you mean by this. I compare "interrupts" with "NAPI
>>> threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why
>>> I believe the other experiments are not meaningful.
>> Yes that is exactly what I am disagreeing with. I don't think other
>> rows are "not meaningful". The xsk_rr is trying to "simulate the
>> application processing" by adding a cpu delay and the table clearly
>> shows the comparison between various mechanisms and how they perform
>> with in load.
> 
> But these experiments only look at cases with almost exactly 100% load. 
> As I mentioned in a previous round, this is highly unlikely for a 
> latency-critical service and thus it seems contrived. Once you go to 
> 100% load and see queueing effects, you also need to look left and right 
> to investigate other load and system settings.

Let me try another way: The delay and rate parameters create a 
two-dimensional configuration space, but you only cherry-pick setups 
that result in near-100% load, which make "NAPI threaded" look 
particularly good. It would be easy to provide a more comprehensive 
evaluation.

And if there's a good reason to avoid using multiple NIC queues, it 
would be good to know that as well.

As I mentioned before, I am not debating that "NAPI threaded" provides 
some performance improvements. I am just asking to present the full picture.

Thanks,
Martin