netdev - Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <willemdebruijn.kernel.338f6ae18246c@gmail.com>
Date: Fri, 29 Aug 2025 14:42:43 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Martin Karsten <mkarsten@...terloo.ca>, 
 Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, 
 "David S . Miller" <davem@...emloft.net>, 
 Eric Dumazet <edumazet@...gle.com>, 
 Paolo Abeni <pabeni@...hat.com>, 
 almasrymina@...gle.com, 
 willemb@...gle.com, 
 Joe Damato <joe@...a.to>, 
 netdev@...r.kernel.org
Subject: Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

Martin Karsten wrote:
> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> > On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
> >>
> >> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>> Extend the already existing support of threaded napi poll to do continuous
> >>> busy polling.
> >>>
> >>> This is used for doing continuous polling of napi to fetch descriptors
> >>> from backing RX/TX queues for low latency applications. Allow enabling
> >>> of threaded busypoll using netlink so this can be enabled on a set of
> >>> dedicated napis for low latency applications.
> >>>
> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>> and set affinity, priority and scheduler for it depending on the
> >>> low-latency requirements.
> >>>
> >>> Extend the netlink interface to allow enabling/disabling threaded
> >>> busypolling at individual napi level.
> >>>
> >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>> level latency requirement. For our usecase we want low jitter and stable
> >>> latency at P99.
> >>>
> >>> Following is an analysis and comparison of available (and compatible)
> >>> busy poll interfaces for a low latency usecase with stable P99. This can
> >>> be suitable for applications that want very low latency at the expense
> >>> of cpu usage and efficiency.
> >>>
> >>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>> NAPI instance in a dedicated thread while ignoring available events or
> >>> packets, regardless of the userspace API. Most existing mechanisms are
> >>> designed to work in a pattern where you poll until new packets or events
> >>> are received, after which userspace is expected to handle them.
> >>>
> >>> As a result, one has to hack together a solution using a mechanism
> >>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>> threaded busy polling, on the other hand, provides this capability
> >>> natively, independent of any userspace API. This makes it really easy to
> >>> setup and manage.
> >>>
> >>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>> description of the tool and how it tries to simulate the real workload
> >>> is following,
> >>>
> >>> - It sends UDP packets between 2 machines.
> >>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>     frequency of the packet being sent, we use open-loop sampling. That is
> >>>     the packets are sent in a separate thread.
> >>> - The server replies to the packet inline by reading the pkt from the
> >>>     recv ring and replies using the tx ring.
> >>> - To simulate the application processing time, we use a configurable
> >>>     delay in usecs on the client side after a reply is received from the
> >>>     server.
> >>>
> >>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>>
> >>> We use this tool with following napi polling configurations,
> >>>
> >>> - Interrupts only
> >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>     packet).
> >>> - SO_BUSYPOLL (separate thread and separate core)
> >>> - Threaded NAPI busypoll
> >>>
> >>> System is configured using following script in all 4 cases,
> >>>
> >>> ```
> >>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>
> >>> sudo ethtool -L eth0 rx 1 tx 1
> >>> sudo ethtool -G eth0 rx 1024
> >>>
> >>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>
> >>>    # pin IRQs on CPU 2
> >>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>                                print arr[0]}' < /proc/interrupts)"
> >>> for irq in "${IRQS}"; \
> >>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>
> >>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>
> >>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>
> >>> if [[ -z "$1" ]]; then
> >>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>>
> >>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >>>
> >>> if [[ "$1" == "enable_threaded" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>     sudo chrt -f  -p 50 $NAPI_T
> >>>
> >>>     # pin threaded poll thread to CPU 2
> >>>     sudo taskset -pc 2 $NAPI_T
> >>> fi
> >>>
> >>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>> ```
> >>
> >> The experiment script above does not work, because the sysfs parameter
> >> does not exist anymore in this version.
> >>
> >>> To enable various configurations, script can be run as following,
> >>>
> >>> - Interrupt Only
> >>>     ```
> >>>     <script> enable_interrupt
> >>>     ```
> >>>
> >>> - SO_BUSYPOLL (no arguments to script)
> >>>     ```
> >>>     <script>
> >>>     ```
> >>>
> >>> - NAPI threaded busypoll
> >>>     ```
> >>>     <script> enable_threaded
> >>>     ```
> >>>
> >>> If using idpf, the script needs to be run again after launching the
> >>> workload just to make sure that the configurations are not reverted. As
> >>> idpf reverts some configurations on software reset when AF_XDP program
> >>> is attached.
> >>>
> >>> Once configured, the workload is run with various configurations using
> >>> following commands. Set period (1/frequency) and delay in usecs to
> >>> produce results for packet frequency and application processing delay.
> >>>
> >>>    ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>> ```
> >>>
> >>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
> >>>
> >>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -t
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>> ```
> >>>
> >>>    ## NAPI Threaded Busy Poll
> >>>
> >>> Argument -n skips the recvfrom call as there is no recv kick needed.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -n
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>> ```
> >>
> >> I believe there's a bug when disabling busy-polled napi threading after
> >> an experiment. My system hangs and needs a hard reset.
> >>
> >>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> >>> |---|---|---|---|---|
> >>> | 12 Kpkt/s + 0us delay | | | | |
> >>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>> | 32 Kpkt/s + 30us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>> | 125 Kpkt/s + 6us delay | | | | |
> >>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>> | 12 Kpkt/s + 78us delay | | | | |
> >>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>> | 25 Kpkt/s + 38us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>
> >> On my system, routing the irq to same core where xsk_rr runs results in
> >> lower latency than routing the irq to a different core. To me that makes
> >> sense in a low-rate latency-sensitive scenario where interrupts are not
> >> causing much trouble, but the resulting locality might be beneficial. I
> >> think you should test this as well.
> >>
> >> The experiments reported above (except for the first one) are
> >> cherry-picking parameter combinations that result in a near-100% load
> >> and ignore anything else. Near-100% load is a highly unlikely scenario
> >> for a latency-sensitive workload.
> >>
> >> When combining the above two paragraphs, I believe other interesting
> >> setups are missing from the experiments, such as comparing to two pairs
> >> of xsk_rr under high load (as mentioned in my previous emails).
> > This is to support an existing real workload. We cannot easily modify
> > its threading model. The two xsk_rr model would be a different
> > workload.
> 
> That's fine, but:
> 
> - In principle I don't think it's a good justification for a kernel 
> change that an application cannot be rewritten.

It's not as narrow as one application. It's a way to scale processing
using pipelining instead of sharding. Both are reasonable approaches.

Especially for XDP, doing this first stage in the kernel makes sense
to me, as it makes XDP closer to hardware descriptor queue based
polling architectures such as DPDK or Google's SPIN. The OS abstracts
away the hardware format and the format translation entirely.

> - I believe it is your responsibility to more comprehensively document 
> the impact of your proposed changes beyond your one particular workload.
>
> Also, I do believe there's a bug as mentioned before. I can't quite pin 
> it down, but every time after running a "NAPI threaded" experiment, my 
> servers enters a funny state and eventually becomes largely unresponsive 
> without much useful output and needs a hard reset. For example:
> 
> 1) Run "NAPI threaded" experiment
> 2) Disabled "threaded" parameter in NAPI config
> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock, 
> because other services stop working successively.
> 
> Do you not have this problem?
> 
> Thanks,
> Martin
>