lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d66c174-32d1-435d-9b1a-5672201dd2e0@uwaterloo.ca>
Date: Fri, 29 Aug 2025 17:27:21 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

On 2025-08-29 16:49, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 11:08 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>>>
>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>> busy polling.
>>>>>
>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>> dedicated napis for low latency applications.
>>>>>
>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>> and set affinity, priority and scheduler for it depending on the
>>>>> low-latency requirements.
>>>>>
>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>> busypolling at individual napi level.
>>>>>
>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>> level latency requirement. For our usecase we want low jitter and stable
>>>>> latency at P99.
>>>>>
>>>>> Following is an analysis and comparison of available (and compatible)
>>>>> busy poll interfaces for a low latency usecase with stable P99. This can
>>>>> be suitable for applications that want very low latency at the expense
>>>>> of cpu usage and efficiency.
>>>>>
>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>> designed to work in a pattern where you poll until new packets or events
>>>>> are received, after which userspace is expected to handle them.
>>>>>
>>>>> As a result, one has to hack together a solution using a mechanism
>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>> threaded busy polling, on the other hand, provides this capability
>>>>> natively, independent of any userspace API. This makes it really easy to
>>>>> setup and manage.
>>>>>
>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>> description of the tool and how it tries to simulate the real workload
>>>>> is following,
>>>>>
>>>>> - It sends UDP packets between 2 machines.
>>>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>>>      frequency of the packet being sent, we use open-loop sampling. That is
>>>>>      the packets are sent in a separate thread.
>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>      recv ring and replies using the tx ring.
>>>>> - To simulate the application processing time, we use a configurable
>>>>>      delay in usecs on the client side after a reply is received from the
>>>>>      server.
>>>>>
>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>>>
>>>>> We use this tool with following napi polling configurations,
>>>>>
>>>>> - Interrupts only
>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>      packet).
>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>> - Threaded NAPI busypoll
>>>>>
>>>>> System is configured using following script in all 4 cases,
>>>>>
>>>>> ```
>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>
>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>> sudo ethtool -G eth0 rx 1024
>>>>>
>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>
>>>>>     # pin IRQs on CPU 2
>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>                                 print arr[0]}' < /proc/interrupts)"
>>>>> for irq in "${IRQS}"; \
>>>>>         do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>
>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>
>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>                         do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>
>>>>> if [[ -z "$1" ]]; then
>>>>>      echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>> fi
>>>>>
>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
>>>>>
>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>      echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>      NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>      sudo chrt -f  -p 50 $NAPI_T
>>>>>
>>>>>      # pin threaded poll thread to CPU 2
>>>>>      sudo taskset -pc 2 $NAPI_T
>>>>> fi
>>>>>
>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>> fi
>>>>> ```
>>>>
>>>> The experiment script above does not work, because the sysfs parameter
>>>> does not exist anymore in this version.
>>>>
>>>>> To enable various configurations, script can be run as following,
>>>>>
>>>>> - Interrupt Only
>>>>>      ```
>>>>>      <script> enable_interrupt
>>>>>      ```
>>>>>
>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>      ```
>>>>>      <script>
>>>>>      ```
>>>>>
>>>>> - NAPI threaded busypoll
>>>>>      ```
>>>>>      <script> enable_threaded
>>>>>      ```
>>>>>
>>>>> If using idpf, the script needs to be run again after launching the
>>>>> workload just to make sure that the configurations are not reverted. As
>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>> is attached.
>>>>>
>>>>> Once configured, the workload is run with various configurations using
>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>> produce results for packet frequency and application processing delay.
>>>>>
>>>>>     ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>> ```
>>>>>
>>>>>     ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>>>
>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>         -h -v -t
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>> ```
>>>>>
>>>>>     ## NAPI Threaded Busy Poll
>>>>>
>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>         -h -v -n
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>> ```
>>>>
>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>> an experiment. My system hangs and needs a hard reset.
>>>>
>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
>>>>> |---|---|---|---|---|
>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>
>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>> lower latency than routing the irq to a different core. To me that makes
>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>> think you should test this as well.
>>>>
>>>> The experiments reported above (except for the first one) are
>>>> cherry-picking parameter combinations that result in a near-100% load
>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>> for a latency-sensitive workload.
>>>>
>>>> When combining the above two paragraphs, I believe other interesting
>>>> setups are missing from the experiments, such as comparing to two pairs
>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>> This is to support an existing real workload. We cannot easily modify
>>> its threading model. The two xsk_rr model would be a different
>>> workload.
>>
>> That's fine, but:
>>
>> - In principle I don't think it's a good justification for a kernel
>> change that an application cannot be rewritten.
>>
>> - I believe it is your responsibility to more comprehensively document
>> the impact of your proposed changes beyond your one particular workload.
>>
>> Also, I do believe there's a bug as mentioned before. I can't quite pin
>> it down, but every time after running a "NAPI threaded" experiment, my
>> servers enters a funny state and eventually becomes largely unresponsive
>> without much useful output and needs a hard reset. For example:
>>
>> 1) Run "NAPI threaded" experiment
>> 2) Disabled "threaded" parameter in NAPI config
>> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock,
>> because other services stop working successively.
> I just tried with this scenario and it seems to work fine.

Ok. I've reproduced it more concisely. This is after a fresh reboot:

sudo ethtool -L ens15f1np1 combined 1

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "busy-poll-enabled"}'

# ping from another machine to this NIC works
# napi thread busy at 100%

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "disabled"}'

# napi thread gone
# ping from another machine does not work
# tcpdump does not show incoming icmp packets
# but machine still responsive on other NIC

sudo ethtool -L ens15f1np1 combined 12

# networking hangs on all NICs
# sudo reboot on console hangs
# hard reset needed, no useful output
>> Do you not have this problem?
> Not Really. Jakub actually fixed a deadlock in napi threaded recently.
> Maybe you are hitting that? Are you using the latest base-commit that
> I have in this patch series?

Yep:
- Ubuntu 24.04.3 LTS system
- base commit before patches is c3199adbe4ffffc7b6536715e0290d1919a45cd9
- NIC driver is ice, PCI id 8086:159b.

Let me know, if you need any other information?

Best,
Martin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ