netdev - Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhT-CEvHUSpjyXQkpkGRN0iX7qMc1p=QOn-iCNQxdycr2A@mail.gmail.com>
Date: Mon, 25 Aug 2025 10:20:10 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	willemb@...gle.com, Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll

On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-08-24 17:54, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> >
> > This is used for doing continuous polling of napi to fetch descriptors
> > from backing RX/TX queues for low latency applications. Allow enabling
> > of threaded busypoll using netlink so this can be enabled on a set of
> > dedicated napis for low latency applications.
> >
> > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > and set affinity, priority and scheduler for it depending on the
> > low-latency requirements.
> >
> > Currently threaded napi is only enabled at device level using sysfs. Add
> > support to enable/disable threaded mode for a napi individually. This
> > can be done using the netlink interface. Extend `napi-set` op in netlink
> > spec that allows setting the `threaded` attribute of a napi.
> >
> > Extend the threaded attribute in napi struct to add an option to enable
> > continuous busy polling. Extend the netlink and sysfs interface to allow
> > enabling/disabling threaded busypolling at device or individual napi
> > level.
> >
> > We use this for our AF_XDP based hard low-latency usecase with usecs
> > level latency requirement. For our usecase we want low jitter and stable
> > latency at P99.
> >
> > Following is an analysis and comparison of available (and compatible)
> > busy poll interfaces for a low latency usecase with stable P99. Please
> > note that the throughput and cpu efficiency is a non-goal.
> >
> > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > description of the tool and how it tries to simulate the real workload
> > is following,
> >
> > - It sends UDP packets between 2 machines.
> > - The client machine sends packets at a fixed frequency. To maintain the
> >    frequency of the packet being sent, we use open-loop sampling. That is
> >    the packets are sent in a separate thread.
> > - The server replies to the packet inline by reading the pkt from the
> >    recv ring and replies using the tx ring.
> > - To simulate the application processing time, we use a configurable
> >    delay in usecs on the client side after a reply is received from the
> >    server.
> >
> > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> >
> > We use this tool with following napi polling configurations,
> >
> > - Interrupts only
> > - SO_BUSYPOLL (inline in the same thread where the client receives the
> >    packet).
> > - SO_BUSYPOLL (separate thread and separate core)
This one uses separate thread and core for polling the napi.
> > - Threaded NAPI busypoll
> >
> > System is configured using following script in all 4 cases,
> >
> > ```
> > echo 0 | sudo tee /sys/class/net/eth0/threaded
> > echo 0 | sudo tee /proc/sys/kernel/timer_migration
> > echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >
> > sudo ethtool -L eth0 rx 1 tx 1
> > sudo ethtool -G eth0 rx 1024
> >
> > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >
> >   # pin IRQs on CPU 2
> > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >                               print arr[0]}' < /proc/interrupts)"
> > for irq in "${IRQS}"; \
> >       do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >
> > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >
> > for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >                       do echo $i; echo 1,2,3,4,5,6 > $i; done
> >
> > if [[ -z "$1" ]]; then
> >    echo 400 | sudo tee /proc/sys/net/core/busy_read
> >    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> >
> > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >
> > if [[ "$1" == "enable_threaded" ]]; then
> >    echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >    echo 0 | sudo tee /proc/sys/net/core/busy_read
> >    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >    echo 2 | sudo tee /sys/class/net/eth0/threaded
> >    NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >    sudo chrt -f  -p 50 $NAPI_T
> >
> >    # pin threaded poll thread to CPU 2
> >    sudo taskset -pc 2 $NAPI_T
> > fi
> >
> > if [[ "$1" == "enable_interrupt" ]]; then
> >    echo 0 | sudo tee /proc/sys/net/core/busy_read
> >    echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> > ```
> >
> > To enable various configurations, script can be run as following,
> >
> > - Interrupt Only
> >    ```
> >    <script> enable_interrupt
> >    ```
> >
> > - SO_BUSYPOLL (no arguments to script)
> >    ```
> >    <script>
> >    ```
> >
> > - NAPI threaded busypoll
> >    ```
> >    <script> enable_threaded
> >    ```
> >
> > If using idpf, the script needs to be run again after launching the
> > workload just to make sure that the configurations are not reverted. As
> > idpf reverts some configurations on software reset when AF_XDP program
> > is attached.
> >
> > Once configured, the workload is run with various configurations using
> > following commands. Set period (1/frequency) and delay in usecs to
> > produce results for packet frequency and application processing delay.
> >
> >   ## Interrupt Only and SO_BUSY_POLL (inline)
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> > ```
> >
> >   ## SO_BUSY_POLL(done in separate core using recvfrom)
> >
> > Argument -t spawns a seprate thread and continuously calls recvfrom.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >       -h -v -t
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> > ```
> >
> >   ## NAPI Threaded Busy Poll
> >
> > Argument -n skips the recvfrom call as there is no recv kick needed.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >       -h -v -n
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> > ```
> >
> > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> > |---|---|---|---|---|
> > | 12 Kpkt/s + 0us delay | | | | |
> > |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> > |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> > |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> > |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> > | 32 Kpkt/s + 30us delay | | | | |
> > |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> > |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> > |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> > |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> > | 125 Kpkt/s + 6us delay | | | | |
> > |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> > |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> > |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> > |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> > | 12 Kpkt/s + 78us delay | | | | |
> > |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> > |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> > |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> > |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> > | 25 Kpkt/s + 38us delay | | | | |
> > |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> > |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> > |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> > |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >
> >   ## Observations
>
> Hi Samiullah,
>
Thanks for the review
> I believe you are comparing apples and oranges with these experiments.
> Because threaded busy poll uses two cores at each end (at 100%), you
The SO_BUSYPOLL(separate) column is actually running in a separate
thread and using two cores. So this is actually comparing apples to
apples.
> should compare with 2 pairs of xsk_rr processes using interrupt mode,
> but each running at half the rate. I am quite certain you would then see
> the same latency as in the baseline experiment - at much reduced cpu
> utilization.
>
> Threaded busy poll reduces p99 latency by just 100 nsec, while
The table in the experiments show much larger differences in latency.
> busy-spinning two cores, at each end - not more not less. I continue to
> believe that this trade-off and these limited benefits need to be
> clearly and explicitly spelled out in the cover letter.
Yes, if you just look at the first row of the table then there is
virtually no difference.
>
> Best,
> Martin
>