netdev - Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a5166a4a-0c9d-420b-a06e-e4754a555a7e@uwaterloo.ca>
Date: Fri, 29 Aug 2025 20:40:55 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

On 2025-08-29 20:21, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 4:37 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-08-29 19:31, Samiullah Khawaja wrote:
>>> On Fri, Aug 29, 2025 at 3:56 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>>>
>>>> On 2025-08-29 18:25, Samiullah Khawaja wrote:
>>>>> On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>>>>>
>>>>>> On 2025-08-29 14:08, Martin Karsten wrote:
>>>>>>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>>>>>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@...terloo.ca>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>>>>>>> Extend the already existing support of threaded napi poll to do
>>>>>>>>>> continuous
>>>>>>>>>> busy polling.
>>>>>>>>>>
>>>>>>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>>>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>>>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>>>>>>> dedicated napis for low latency applications.
>>>>>>>>>>
>>>>>>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>>>>>>> and set affinity, priority and scheduler for it depending on the
>>>>>>>>>> low-latency requirements.
>>>>>>>>>>
>>>>>>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>>>>>>> busypolling at individual napi level.
>>>>>>>>>>
>>>>>>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>>>>>>> level latency requirement. For our usecase we want low jitter and
>>>>>>>>>> stable
>>>>>>>>>> latency at P99.
>>>>>>>>>>
>>>>>>>>>> Following is an analysis and comparison of available (and compatible)
>>>>>>>>>> busy poll interfaces for a low latency usecase with stable P99. This
>>>>>>>>>> can
>>>>>>>>>> be suitable for applications that want very low latency at the expense
>>>>>>>>>> of cpu usage and efficiency.
>>>>>>>>>>
>>>>>>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>>>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>>>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>>>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>>>>>>> designed to work in a pattern where you poll until new packets or
>>>>>>>>>> events
>>>>>>>>>> are received, after which userspace is expected to handle them.
>>>>>>>>>>
>>>>>>>>>> As a result, one has to hack together a solution using a mechanism
>>>>>>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>>>>>>> threaded busy polling, on the other hand, provides this capability
>>>>>>>>>> natively, independent of any userspace API. This makes it really
>>>>>>>>>> easy to
>>>>>>>>>> setup and manage.
>>>>>>>>>>
>>>>>>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>>>>>>> description of the tool and how it tries to simulate the real workload
>>>>>>>>>> is following,
>>>>>>>>>>
>>>>>>>>>> - It sends UDP packets between 2 machines.
>>>>>>>>>> - The client machine sends packets at a fixed frequency. To maintain
>>>>>>>>>> the
>>>>>>>>>>        frequency of the packet being sent, we use open-loop sampling.
>>>>>>>>>> That is
>>>>>>>>>>        the packets are sent in a separate thread.
>>>>>>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>>>>>>        recv ring and replies using the tx ring.
>>>>>>>>>> - To simulate the application processing time, we use a configurable
>>>>>>>>>>        delay in usecs on the client side after a reply is received from
>>>>>>>>>> the
>>>>>>>>>>        server.
>>>>>>>>>>
>>>>>>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
>>>>>>>>>> selftest.
>>>>>>>>>>
>>>>>>>>>> We use this tool with following napi polling configurations,
>>>>>>>>>>
>>>>>>>>>> - Interrupts only
>>>>>>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>>>>>>        packet).
>>>>>>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>>>>>>> - Threaded NAPI busypoll
>>>>>>>>>>
>>>>>>>>>> System is configured using following script in all 4 cases,
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>>>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>>>>>>
>>>>>>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>>>>>>> sudo ethtool -G eth0 rx 1024
>>>>>>>>>>
>>>>>>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>>>>>>
>>>>>>>>>>       # pin IRQs on CPU 2
>>>>>>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>>>>>>                                   print arr[0]}' < /proc/interrupts)"
>>>>>>>>>> for irq in "${IRQS}"; \
>>>>>>>>>>           do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>>>>>>
>>>>>>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>>>>>>
>>>>>>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>>>>>>                           do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>>>>>>
>>>>>>>>>> if [[ -z "$1" ]]; then
>>>>>>>>>>        echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>> fi
>>>>>>>>>>
>>>>>>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
>>>>>>>>>> usecs 0
>>>>>>>>>>
>>>>>>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>>        echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>>>>        NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>>>>>>        sudo chrt -f  -p 50 $NAPI_T
>>>>>>>>>>
>>>>>>>>>>        # pin threaded poll thread to CPU 2
>>>>>>>>>>        sudo taskset -pc 2 $NAPI_T
>>>>>>>>>> fi
>>>>>>>>>>
>>>>>>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>> fi
>>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> The experiment script above does not work, because the sysfs parameter
>>>>>>>>> does not exist anymore in this version.
>>>>>>>>>
>>>>>>>>>> To enable various configurations, script can be run as following,
>>>>>>>>>>
>>>>>>>>>> - Interrupt Only
>>>>>>>>>>        ```
>>>>>>>>>>        <script> enable_interrupt
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>>>>>>        ```
>>>>>>>>>>        <script>
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> - NAPI threaded busypoll
>>>>>>>>>>        ```
>>>>>>>>>>        <script> enable_threaded
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> If using idpf, the script needs to be run again after launching the
>>>>>>>>>> workload just to make sure that the configurations are not reverted. As
>>>>>>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>>>>>>> is attached.
>>>>>>>>>>
>>>>>>>>>> Once configured, the workload is run with various configurations using
>>>>>>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>>>>>>> produce results for packet frequency and application processing delay.
>>>>>>>>>>
>>>>>>>>>>       ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
>>>>>>>>>> h -v
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>>       ## SO_BUSYPOLL(done in separate core using recvfrom)
>>> Defines this test case clearly here.
>>>>>>>>>>
>>>>>>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>> This defines the -t argument and clearly states that it spawns the
>>> separate thread.
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>>>           -h -v -t
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>>>>>>> ```
>>>>
>>>> see below
>>>>>>>>>>       ## NAPI Threaded Busy Poll
>>> Section for NAPI Threaded Busy Poll scenario
>>>>>>>>>>
>>>>>>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>> States -n argument and defines it.
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>>>           -h -v -n
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>>>>>>> ```
>>>>
>>>> see below
>>>>>>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>>>>>>> an experiment. My system hangs and needs a hard reset.
>>>>>>>>>
>>>>>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
>>>>>>>>>> NAPI threaded |
>>>>>>>>>> |---|---|---|---|---|
>>>>>>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>>>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>>>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>>>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>>>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>>>>>>
>>>>>>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>>>>>>> lower latency than routing the irq to a different core. To me that makes
>>>>>>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>>>>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>>>>>>> think you should test this as well.
>>>>>>>>>
>>>>>>>>> The experiments reported above (except for the first one) are
>>>>>>>>> cherry-picking parameter combinations that result in a near-100% load
>>>>>>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>>>>>>> for a latency-sensitive workload.
>>>>>>>>>
>>>>>>>>> When combining the above two paragraphs, I believe other interesting
>>>>>>>>> setups are missing from the experiments, such as comparing to two pairs
>>>>>>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>>>>>>> This is to support an existing real workload. We cannot easily modify
>>>>>>>> its threading model. The two xsk_rr model would be a different
>>>>>>>> workload.
>>>>>>>
>>>>>>> That's fine, but:
>>>>>>>
>>>>>>> - In principle I don't think it's a good justification for a kernel
>>>>>>> change that an application cannot be rewritten.
>>>>>>>
>>>>>>> - I believe it is your responsibility to more comprehensively document
>>>>>>> the impact of your proposed changes beyond your one particular workload.>
>>>>>> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
>>>>>>
>>>>>> - Using -t for the client reduces latency compared to -T.
>>>>> That is understandable and also it is part of the data I presented. -t
>>>>> means running the SO_BUSY_POLL in a separate thread. Removing -T would
>>>>> invalidate the workload by making the rate unpredictable.
>>>>
>>>> That's another problem with your cover letter then. The experiment as
>>>> described should match the data presented. See above.
>>> The experiments are described clearly. I have pointed out the areas in
>>> the cover letter where these are documented. Where is the mismatch?
>>
>> Ah, I missed the -t at the end, sorry, my bad.
>>
>>>>>> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
>>>>>> reduces latency.
>>>>
>>>> Any thoughts on this one?
>>> I think we discussed this already in the previous iteration, with
>>> Stanislav, and how it will suffer the same way SO_BUSYPOLL suffers. As
>>> I have already stated, for my workload every microsecond matters and
>>> the CPU efficiency is not an issue.
>>
>> Discussing is one thing. Testing is another. In my setup I observe a
>> noticeable difference between using recvfrom and poll.
> I experimented with it and it seems to improve a little bit in some
> cases (maybe 200nsecs) but performs really badly with a low packet
> rate as expected. As discussed in the last iteration and also in the
> cover letter, this is because it only polls when there are no events.
> 
> count: 1249 p5: 17200
> count: 12436 p50: 21100
> count: 21106 p95: 21700
> count: 21994 p99: 21700
> rate: 24995
> outstanding packets: 5
> 
> Like I stated in the cover letter and documentation, one can try to
> hack together something using APIs designed to recv packets or events
> but it's better to have a native mechanism supported by OS that is
> designed to poll underlying NAPIs if that is what user wants.
I see, thanks for checking. I got sidetracked and was looking at yet 
another setup (0 period, 0 delay). The timing of xsk_rr with "work" and 
I/O being interleaved seems like a special case (not to mention the 100% 
load). Anyway, I am sure you will post again and I will make my 
statement about a comprehensive evaluation again. ;-)

Best,
Martin