[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63ff1034-4fd0-46ee-ae6e-1ca2efc18b1c@uwaterloo.ca>
Date: Fri, 29 Aug 2025 18:19:14 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
Joe Damato <joe@...a.to>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
On 2025-08-29 14:08, Martin Karsten wrote:
> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@...terloo.ca>
>> wrote:
>>>
>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>> Extend the already existing support of threaded napi poll to do
>>>> continuous
>>>> busy polling.
>>>>
>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>> dedicated napis for low latency applications.
>>>>
>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>> and set affinity, priority and scheduler for it depending on the
>>>> low-latency requirements.
>>>>
>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>> busypolling at individual napi level.
>>>>
>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>> level latency requirement. For our usecase we want low jitter and
>>>> stable
>>>> latency at P99.
>>>>
>>>> Following is an analysis and comparison of available (and compatible)
>>>> busy poll interfaces for a low latency usecase with stable P99. This
>>>> can
>>>> be suitable for applications that want very low latency at the expense
>>>> of cpu usage and efficiency.
>>>>
>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>> designed to work in a pattern where you poll until new packets or
>>>> events
>>>> are received, after which userspace is expected to handle them.
>>>>
>>>> As a result, one has to hack together a solution using a mechanism
>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>> threaded busy polling, on the other hand, provides this capability
>>>> natively, independent of any userspace API. This makes it really
>>>> easy to
>>>> setup and manage.
>>>>
>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>> description of the tool and how it tries to simulate the real workload
>>>> is following,
>>>>
>>>> - It sends UDP packets between 2 machines.
>>>> - The client machine sends packets at a fixed frequency. To maintain
>>>> the
>>>> frequency of the packet being sent, we use open-loop sampling.
>>>> That is
>>>> the packets are sent in a separate thread.
>>>> - The server replies to the packet inline by reading the pkt from the
>>>> recv ring and replies using the tx ring.
>>>> - To simulate the application processing time, we use a configurable
>>>> delay in usecs on the client side after a reply is received from
>>>> the
>>>> server.
>>>>
>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
>>>> selftest.
>>>>
>>>> We use this tool with following napi polling configurations,
>>>>
>>>> - Interrupts only
>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>> packet).
>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>> - Threaded NAPI busypoll
>>>>
>>>> System is configured using following script in all 4 cases,
>>>>
>>>> ```
>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>> echo off | sudo tee /sys/devices/system/cpu/smt/control
>>>>
>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>> sudo ethtool -G eth0 rx 1024
>>>>
>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>
>>>> # pin IRQs on CPU 2
>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>> print arr[0]}' < /proc/interrupts)"
>>>> for irq in "${IRQS}"; \
>>>> do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>
>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>
>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>> do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>
>>>> if [[ -z "$1" ]]; then
>>>> echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>> fi
>>>>
>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
>>>> usecs 0
>>>>
>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>> echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>> echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>> echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>> NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>> sudo chrt -f -p 50 $NAPI_T
>>>>
>>>> # pin threaded poll thread to CPU 2
>>>> sudo taskset -pc 2 $NAPI_T
>>>> fi
>>>>
>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>> echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>> echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>> fi
>>>> ```
>>>
>>> The experiment script above does not work, because the sysfs parameter
>>> does not exist anymore in this version.
>>>
>>>> To enable various configurations, script can be run as following,
>>>>
>>>> - Interrupt Only
>>>> ```
>>>> <script> enable_interrupt
>>>> ```
>>>>
>>>> - SO_BUSYPOLL (no arguments to script)
>>>> ```
>>>> <script>
>>>> ```
>>>>
>>>> - NAPI threaded busypoll
>>>> ```
>>>> <script> enable_threaded
>>>> ```
>>>>
>>>> If using idpf, the script needs to be run again after launching the
>>>> workload just to make sure that the configurations are not reverted. As
>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>> is attached.
>>>>
>>>> Once configured, the workload is run with various configurations using
>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>> produce results for packet frequency and application processing delay.
>>>>
>>>> ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
>>>> h -v
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v
>>>> ```
>>>>
>>>> ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>>
>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>> -h -v -t
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t
>>>> ```
>>>>
>>>> ## NAPI Threaded Busy Poll
>>>>
>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>> -h -v -n
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n
>>>> ```
>>>
>>> I believe there's a bug when disabling busy-polled napi threading after
>>> an experiment. My system hangs and needs a hard reset.
>>>
>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
>>>> NAPI threaded |
>>>> |---|---|---|---|---|
>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>
>>> On my system, routing the irq to same core where xsk_rr runs results in
>>> lower latency than routing the irq to a different core. To me that makes
>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>> causing much trouble, but the resulting locality might be beneficial. I
>>> think you should test this as well.
>>>
>>> The experiments reported above (except for the first one) are
>>> cherry-picking parameter combinations that result in a near-100% load
>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>> for a latency-sensitive workload.
>>>
>>> When combining the above two paragraphs, I believe other interesting
>>> setups are missing from the experiments, such as comparing to two pairs
>>> of xsk_rr under high load (as mentioned in my previous emails).
>> This is to support an existing real workload. We cannot easily modify
>> its threading model. The two xsk_rr model would be a different
>> workload.
>
> That's fine, but:
>
> - In principle I don't think it's a good justification for a kernel
> change that an application cannot be rewritten.
>
> - I believe it is your responsibility to more comprehensively document
> the impact of your proposed changes beyond your one particular workload.>
A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
- Using -t for the client reduces latency compared to -T.
- Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
reduces latency.
Best,
Martin
Powered by blists - more mailing lists