netdev - Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhRuJYakS4=zqtB7QzthJE+1UQfcaqT2bcj6sWPN_6Akeg@mail.gmail.com>
Date: Wed, 26 Mar 2025 13:34:30 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	willemb@...gle.com, jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-03-25 12:40, Samiullah Khawaja wrote:
> > On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
> >>
> >> On 2025-03-20 22:15, Samiullah Khawaja wrote:
> >>> Extend the already existing support of threaded napi poll to do continuous
> >>> busy polling.
> >>>
> >>> This is used for doing continuous polling of napi to fetch descriptors
> >>> from backing RX/TX queues for low latency applications. Allow enabling
> >>> of threaded busypoll using netlink so this can be enabled on a set of
> >>> dedicated napis for low latency applications.
> >>>
> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>> and set affinity, priority and scheduler for it depending on the
> >>> low-latency requirements.
> >>>
> >>> Currently threaded napi is only enabled at device level using sysfs. Add
> >>> support to enable/disable threaded mode for a napi individually. This
> >>> can be done using the netlink interface. Extend `napi-set` op in netlink
> >>> spec that allows setting the `threaded` attribute of a napi.
> >>>
> >>> Extend the threaded attribute in napi struct to add an option to enable
> >>> continuous busy polling. Extend the netlink and sysfs interface to allow
> >>> enabling/disabling threaded busypolling at device or individual napi
> >>> level.
> >>>
> >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>> level latency requirement. For our usecase we want low jitter and stable
> >>> latency at P99.
> >>>
> >>> Following is an analysis and comparison of available (and compatible)
> >>> busy poll interfaces for a low latency usecase with stable P99. Please
> >>> note that the throughput and cpu efficiency is a non-goal.
> >>>
> >>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> >>> description of the tool and how it tries to simulate the real workload
> >>> is following,
> >>>
> >>> - It sends UDP packets between 2 machines.
> >>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>     frequency of the packet being sent, we use open-loop sampling. That is
> >>>     the packets are sent in a separate thread.
> >>> - The server replies to the packet inline by reading the pkt from the
> >>>     recv ring and replies using the tx ring.
> >>> - To simulate the application processing time, we use a configurable
> >>>     delay in usecs on the client side after a reply is received from the
> >>>     server.
> >>>
> >>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>
> >> Thanks very much for sending the benchmark program and these specific
> >> experiments. I am able to build the tool and run the experiments in
> >> principle. While I don't have a complete picture yet, one observation
> >> seems already clear, so I want to report back on it.
> > Thanks for reproducing this Martin. Really appreciate you reviewing
> > this and your interest in this.
> >>
> >>> We use this tool with following napi polling configurations,
> >>>
> >>> - Interrupts only
> >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>     packet).
> >>> - SO_BUSYPOLL (separate thread and separate core)
> >>> - Threaded NAPI busypoll
> >>
> >> The configurations that you describe as SO_BUSYPOLL here are not using
> >> the best busy-polling configuration. The best busy-polling strictly
> >> alternates between application processing and network polling. No
> >> asynchronous processing due to hardware irq delivery or softirq
> >> processing should happen.
> >>
> >> A high-level check is making sure that no softirq processing is reported
> >> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
> >> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
> >>
> >> Unfortunately it is not always straightforward to enter this pattern. In
> >> this particular case, it seems that two pieces are missing:
> >>
> >> 1) Because the XPD socket is created with XDP_COPY, it is never marked
> >> with its corresponding napi_id. Without the socket being marked with a
> >> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
> >> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
> >> softirq loop controls packet delivery.
> > Nice catch. It seems a recent change broke the busy polling for AF_XDP
> > and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> > broken and seems in my experiments I didn't pick that up. During my
> > experimentation I confirmed that all experiment modes are invoking the
> > busypoll and not going through softirqs. I confirmed this through perf
> > traces. I sent out a fix for XDP_COPY busy polling here in the link
> > below. I will resent this for the net since the original commit has
> > already landed in 6.13.
> > https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
> >>
> >> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
> >>
> >>          if (xs->zc && qid < dev->real_num_rx_queues) {
> >>                  struct netdev_rx_queue *rxq;
> >>
> >>                  rxq = __netif_get_rx_queue(dev, qid);
> >>                  if (rxq->napi)
> >>                          __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
> >>          }
> >>
> >> I am not an expert on XDP sockets, so I don't know why that is or what
> >> would be an acceptable workaround/fix, but when I simply remove the
> >> check for xs->zc, the socket is being marked and napi_busy_loop is being
> >> called. But maybe there's a better way to accomplish this.
> > +1
> >>
> >> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
> >> that busy polling stays in control after napi_busy_loop, regardless of
> >> how many packets were found. Without this setting, the gro_flush_timeout
> >> timer is not extended in busy_poll_stop.
> >>
> >> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
> >> better in my experiments and come closer to Threaded NAPI busypoll, so I
> >> was wondering if you could try that in your environment? While this
> >> might not change the big picture, I think it's important to fully
> >> understand and document the trade-offs.
> > I agree. In my experiments the SO_BUSYPOLL works properly, please see
> > the commit I mentioned above. But I will experiment with
> > SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>
> I'd like to clarify: Your original experiments cannot have used
> busypoll, because it was broken for XDP_COPY. Did you rerun the
On my idpf test platform the AF_XDP support is broken with the latest
kernel, so I didn't have the original commit that broke AF_XDP
busypoll for zerocopy and copy mode. So in the experiments that I
shared XDP_COPY busy poll has been working. Please see the traces
below. Sorry for the confusion.
> experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and
I tried with SO_PREFER_BUSY_POLL as you suggested, I see results
matching the previous observation:

12Kpkts/sec with 78usecs delay:

INLINE:
p5: 16700
p50: 17100
p95: 17200
p99: 17200

> see the same latency numbers as before? Also, can you provide more
> details about the perf tracing that you used to see that busypoll is
> invoked, but softirq is not?
I used the following command to record the call graph and could see
the calls to napi_busy_loop going from xsk_rcvmsg. Confirmed with
SO_PREFER_BUSY_POLL also below,
```
perf record -o prefer.perf -a -e cycles -g sleep 10
perf report --stdio -i prefer.perf
```

```
 --1.35%--entry_SYSCALL_64
            |
             --1.31%--do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       xsk_recvmsg
                       __xsk_recvmsg.constprop.0.isra.0
                       napi_busy_loop
                       __napi_busy_loop
```

I do see softirq getting triggered occasionally, when inline the
busy_poll thread is not able to pick up the packets. I used following
command to find number of samples for each in the trace,

```
perf report -g -n -i prefer.perf
```

Filtered the results to include only the interesting symbols
```
<
Children      Self       Samples  Command          Shared Object
          Symbol
+    1.48%     0.06%            46  xsk_rr           [idpf]
            [k] idpf_vport_splitq_napi_poll

+    1.28%     0.11%            86  xsk_rr           [kernel.kallsyms]
            [k] __napi_busy_loop

+    0.71%     0.02%            17  xsk_rr           [kernel.kallsyms]
            [k] net_rx_action

+    0.69%     0.01%             6  xsk_rr           [kernel.kallsyms]
            [k] __napi_poll
```

>
> Thanks,
> Martin
>