lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhSGp6CaHXsO5EDANPHA=wpOO2C=4819+75fLoSuFL2dHA@mail.gmail.com>
Date: Tue, 25 Mar 2025 09:40:10 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Martin Karsten <mkarsten@...terloo.ca>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, 
	willemb@...gle.com, jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>
> On 2025-03-20 22:15, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> >
> > This is used for doing continuous polling of napi to fetch descriptors
> > from backing RX/TX queues for low latency applications. Allow enabling
> > of threaded busypoll using netlink so this can be enabled on a set of
> > dedicated napis for low latency applications.
> >
> > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > and set affinity, priority and scheduler for it depending on the
> > low-latency requirements.
> >
> > Currently threaded napi is only enabled at device level using sysfs. Add
> > support to enable/disable threaded mode for a napi individually. This
> > can be done using the netlink interface. Extend `napi-set` op in netlink
> > spec that allows setting the `threaded` attribute of a napi.
> >
> > Extend the threaded attribute in napi struct to add an option to enable
> > continuous busy polling. Extend the netlink and sysfs interface to allow
> > enabling/disabling threaded busypolling at device or individual napi
> > level.
> >
> > We use this for our AF_XDP based hard low-latency usecase with usecs
> > level latency requirement. For our usecase we want low jitter and stable
> > latency at P99.
> >
> > Following is an analysis and comparison of available (and compatible)
> > busy poll interfaces for a low latency usecase with stable P99. Please
> > note that the throughput and cpu efficiency is a non-goal.
> >
> > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > description of the tool and how it tries to simulate the real workload
> > is following,
> >
> > - It sends UDP packets between 2 machines.
> > - The client machine sends packets at a fixed frequency. To maintain the
> >    frequency of the packet being sent, we use open-loop sampling. That is
> >    the packets are sent in a separate thread.
> > - The server replies to the packet inline by reading the pkt from the
> >    recv ring and replies using the tx ring.
> > - To simulate the application processing time, we use a configurable
> >    delay in usecs on the client side after a reply is received from the
> >    server.
> >
> > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>
> Thanks very much for sending the benchmark program and these specific
> experiments. I am able to build the tool and run the experiments in
> principle. While I don't have a complete picture yet, one observation
> seems already clear, so I want to report back on it.
Thanks for reproducing this Martin. Really appreciate you reviewing
this and your interest in this.
>
> > We use this tool with following napi polling configurations,
> >
> > - Interrupts only
> > - SO_BUSYPOLL (inline in the same thread where the client receives the
> >    packet).
> > - SO_BUSYPOLL (separate thread and separate core)
> > - Threaded NAPI busypoll
>
> The configurations that you describe as SO_BUSYPOLL here are not using
> the best busy-polling configuration. The best busy-polling strictly
> alternates between application processing and network polling. No
> asynchronous processing due to hardware irq delivery or softirq
> processing should happen.
>
> A high-level check is making sure that no softirq processing is reported
> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>
> Unfortunately it is not always straightforward to enter this pattern. In
> this particular case, it seems that two pieces are missing:
>
> 1) Because the XPD socket is created with XDP_COPY, it is never marked
> with its corresponding napi_id. Without the socket being marked with a
> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
> softirq loop controls packet delivery.
Nice catch. It seems a recent change broke the busy polling for AF_XDP
and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
broken and seems in my experiments I didn't pick that up. During my
experimentation I confirmed that all experiment modes are invoking the
busypoll and not going through softirqs. I confirmed this through perf
traces. I sent out a fix for XDP_COPY busy polling here in the link
below. I will resent this for the net since the original commit has
already landed in 6.13.
https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>
> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>
>         if (xs->zc && qid < dev->real_num_rx_queues) {
>                 struct netdev_rx_queue *rxq;
>
>                 rxq = __netif_get_rx_queue(dev, qid);
>                 if (rxq->napi)
>                         __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>         }
>
> I am not an expert on XDP sockets, so I don't know why that is or what
> would be an acceptable workaround/fix, but when I simply remove the
> check for xs->zc, the socket is being marked and napi_busy_loop is being
> called. But maybe there's a better way to accomplish this.
+1
>
> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
> that busy polling stays in control after napi_busy_loop, regardless of
> how many packets were found. Without this setting, the gro_flush_timeout
> timer is not extended in busy_poll_stop.
>
> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
> better in my experiments and come closer to Threaded NAPI busypoll, so I
> was wondering if you could try that in your environment? While this
> might not change the big picture, I think it's important to fully
> understand and document the trade-offs.
I agree. In my experiments the SO_BUSYPOLL works properly, please see
the commit I mentioned above. But I will experiment with
SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>
> Thanks,
> Martin
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ