netdev - Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5cf6b8cd-bba6-4a68-a0b2-58c584d90886@uwaterloo.ca>
Date: Wed, 26 Mar 2025 17:22:07 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

On 2025-03-26 16:34, Samiullah Khawaja wrote:
> On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-03-25 12:40, Samiullah Khawaja wrote:
>>> On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>>>
>>>> On 2025-03-20 22:15, Samiullah Khawaja wrote:
>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>> busy polling.
>>>>>
>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>> dedicated napis for low latency applications.
>>>>>
>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>> and set affinity, priority and scheduler for it depending on the
>>>>> low-latency requirements.
>>>>>
>>>>> Currently threaded napi is only enabled at device level using sysfs. Add
>>>>> support to enable/disable threaded mode for a napi individually. This
>>>>> can be done using the netlink interface. Extend `napi-set` op in netlink
>>>>> spec that allows setting the `threaded` attribute of a napi.
>>>>>
>>>>> Extend the threaded attribute in napi struct to add an option to enable
>>>>> continuous busy polling. Extend the netlink and sysfs interface to allow
>>>>> enabling/disabling threaded busypolling at device or individual napi
>>>>> level.
>>>>>
>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>> level latency requirement. For our usecase we want low jitter and stable
>>>>> latency at P99.
>>>>>
>>>>> Following is an analysis and comparison of available (and compatible)
>>>>> busy poll interfaces for a low latency usecase with stable P99. Please
>>>>> note that the throughput and cpu efficiency is a non-goal.
>>>>>
>>>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
>>>>> description of the tool and how it tries to simulate the real workload
>>>>> is following,
>>>>>
>>>>> - It sends UDP packets between 2 machines.
>>>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>>>      frequency of the packet being sent, we use open-loop sampling. That is
>>>>>      the packets are sent in a separate thread.
>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>      recv ring and replies using the tx ring.
>>>>> - To simulate the application processing time, we use a configurable
>>>>>      delay in usecs on the client side after a reply is received from the
>>>>>      server.
>>>>>
>>>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>>
>>>> Thanks very much for sending the benchmark program and these specific
>>>> experiments. I am able to build the tool and run the experiments in
>>>> principle. While I don't have a complete picture yet, one observation
>>>> seems already clear, so I want to report back on it.
>>> Thanks for reproducing this Martin. Really appreciate you reviewing
>>> this and your interest in this.
>>>>
>>>>> We use this tool with following napi polling configurations,
>>>>>
>>>>> - Interrupts only
>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>      packet).
>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>> - Threaded NAPI busypoll
>>>>
>>>> The configurations that you describe as SO_BUSYPOLL here are not using
>>>> the best busy-polling configuration. The best busy-polling strictly
>>>> alternates between application processing and network polling. No
>>>> asynchronous processing due to hardware irq delivery or softirq
>>>> processing should happen.
>>>>
>>>> A high-level check is making sure that no softirq processing is reported
>>>> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
>>>> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>>>>
>>>> Unfortunately it is not always straightforward to enter this pattern. In
>>>> this particular case, it seems that two pieces are missing:
>>>>
>>>> 1) Because the XPD socket is created with XDP_COPY, it is never marked
>>>> with its corresponding napi_id. Without the socket being marked with a
>>>> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
>>>> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
>>>> softirq loop controls packet delivery.
>>> Nice catch. It seems a recent change broke the busy polling for AF_XDP
>>> and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
>>> broken and seems in my experiments I didn't pick that up. During my
>>> experimentation I confirmed that all experiment modes are invoking the
>>> busypoll and not going through softirqs. I confirmed this through perf
>>> traces. I sent out a fix for XDP_COPY busy polling here in the link
>>> below. I will resent this for the net since the original commit has
>>> already landed in 6.13.
>>> https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>>>>
>>>> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>>>>
>>>>           if (xs->zc && qid < dev->real_num_rx_queues) {
>>>>                   struct netdev_rx_queue *rxq;
>>>>
>>>>                   rxq = __netif_get_rx_queue(dev, qid);
>>>>                   if (rxq->napi)
>>>>                           __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>>>>           }
>>>>
>>>> I am not an expert on XDP sockets, so I don't know why that is or what
>>>> would be an acceptable workaround/fix, but when I simply remove the
>>>> check for xs->zc, the socket is being marked and napi_busy_loop is being
>>>> called. But maybe there's a better way to accomplish this.
>>> +1
>>>>
>>>> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
>>>> that busy polling stays in control after napi_busy_loop, regardless of
>>>> how many packets were found. Without this setting, the gro_flush_timeout
>>>> timer is not extended in busy_poll_stop.
>>>>
>>>> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
>>>> better in my experiments and come closer to Threaded NAPI busypoll, so I
>>>> was wondering if you could try that in your environment? While this
>>>> might not change the big picture, I think it's important to fully
>>>> understand and document the trade-offs.
>>> I agree. In my experiments the SO_BUSYPOLL works properly, please see
>>> the commit I mentioned above. But I will experiment with
>>> SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>>
>> I'd like to clarify: Your original experiments cannot have used
>> busypoll, because it was broken for XDP_COPY. Did you rerun the
> On my idpf test platform the AF_XDP support is broken with the latest
> kernel, so I didn't have the original commit that broke AF_XDP
> busypoll for zerocopy and copy mode. So in the experiments that I
> shared XDP_COPY busy poll has been working. Please see the traces
> below. Sorry for the confusion.

Ok, that explains it.

>> experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and
> I tried with SO_PREFER_BUSY_POLL as you suggested, I see results
> matching the previous observation:
> 
> 12Kpkts/sec with 78usecs delay:
> 
> INLINE:
> p5: 16700
> p50: 17100
> p95: 17200
> p99: 17200

This comment applies to the experiments overall: I believe these 
carefully crafted period/delay configurations that just straddle the 
capacity limit do not show any additional benefits over and above what 
the basic experiments (without application delay) already show.

If you want to illustrate the fact that the slightly faster mechanism 
reaches capacity a little later, I would find experiments with a fixed 
period and varying the delay from 0 to overload more illustrative.

>> see the same latency numbers as before? Also, can you provide more
>> details about the perf tracing that you used to see that busypoll is
>> invoked, but softirq is not?
> I used the following command to record the call graph and could see
> the calls to napi_busy_loop going from xsk_rcvmsg. Confirmed with
> SO_PREFER_BUSY_POLL also below,
> ```
> perf record -o prefer.perf -a -e cycles -g sleep 10
> perf report --stdio -i prefer.perf
> ```
> 
> ```
>   --1.35%--entry_SYSCALL_64
>              |
>               --1.31%--do_syscall_64
>                         __x64_sys_recvfrom
>                         __sys_recvfrom
>                         sock_recvmsg
>                         xsk_recvmsg
>                         __xsk_recvmsg.constprop.0.isra.0
>                         napi_busy_loop
>                         __napi_busy_loop
> ```
> 
> I do see softirq getting triggered occasionally, when inline the
> busy_poll thread is not able to pick up the packets. I used following
> command to find number of samples for each in the trace,
> 
> ```
> perf report -g -n -i prefer.perf
> ```
> 
> Filtered the results to include only the interesting symbols
> ```
> <
> Children      Self       Samples  Command          Shared Object
>            Symbol
> +    1.48%     0.06%            46  xsk_rr           [idpf]
>              [k] idpf_vport_splitq_napi_poll
> 
> +    1.28%     0.11%            86  xsk_rr           [kernel.kallsyms]
>              [k] __napi_busy_loop
> 
> +    0.71%     0.02%            17  xsk_rr           [kernel.kallsyms]
>              [k] net_rx_action
> 
> +    0.69%     0.01%             6  xsk_rr           [kernel.kallsyms]
>              [k] __napi_poll
> ```

Thanks, this makes me realize that I forgot to mention something as well:

SO_PREFER_BUSY_POLL should eliminate the remaining softirq invocations, 
but only if gro_flush_timeout is big enough. In fact, in a full busypoll 
configuration, the value of gro_flush_timeout should not matter at all 
as long as its sufficiently higher than the application period. I have 
set it to 1000000 for these experiments as another litmus test that 
busypoll is actually working.

Last not least, I found that co-locating the single-threaded 
busy-polling application with the irq core improved the outcome. I.e., 
in your experiment setup you would taskset the application to Core 2. 
Not sure I have a rock-solid explanation, but it did make a difference.

Since net-next patch submission is closed, I thought I provide this 
feedback now, so you can decide whether to take it into account for the 
next go-around.

Best,
Martin