[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250912184608.6c6a7c51@kernel.org>
Date: Fri, 12 Sep 2025 18:46:08 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: "David S . Miller " <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
almasrymina@...gle.com, willemb@...gle.com, Joe Damato <joe@...a.to>,
mkarsten@...terloo.ca, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow
kthread based busy polling
On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote:
> Add a new state to napi state enum:
>
> - NAPI_STATE_THREADED_BUSY_POLL
> Threaded busy poll is enabled/running for this napi.
>
> Following changes are introduced in the napi scheduling and state logic:
>
> - When threaded busy poll is enabled through netlink it also enables
> NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
> going to busy poll the napi.
>
> - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
> kthread is woken up, the kthread owns the context. If
> NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
> set then it means that kthread can busy poll.
>
> - To keep busy polling and to avoid scheduling of the interrupts, the
> napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
> and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
> napi_complete_done returns early to avoid the
> NAPI_STATE_SCHED_THREADED being unset.
>
> - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
> napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
> bit also. This will make the associated kthread go to sleep as per
> existing logic.
>
> Signed-off-by: Samiullah Khawaja <skhawaja@...gle.com>
I think you need to spend some time trying to make this code more..
elegant.
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c035dc0f64fd..ce28e8708a87 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -88,7 +88,7 @@ definitions:
> -
> name: napi-threaded
> type: enum
> - entries: [disabled, enabled]
> + entries: [disabled, enabled, busy-poll-enabled]
drop the -enabled
> attribute-sets:
> -
> @@ -291,7 +291,8 @@ attribute-sets:
> name: threaded
> doc: Whether the NAPI is configured to operate in threaded polling
> mode. If this is set to enabled then the NAPI context operates
> - in threaded polling mode.
> + in threaded polling mode. If this is set to busy-poll-enabled
> + then the NAPI kthread also does busypolling.
I don't think busypolling is a word? I mean, I don't think English
combines words like this.
> +Threaded NAPI busy polling
> +--------------------------
Please feed the documentation into a grammar checker. A bunch of
articles seems to be missing.
> +Threaded NAPI allows processing of packets from each NAPI in a kthread in
> +kernel. Threaded NAPI busy polling extends this and adds support to do
> +continuous busy polling of this NAPI. This can be used to enable busy polling
> +independent of userspace application or the API (epoll, io_uring, raw sockets)
> +being used in userspace to process the packets.
> +
> +It can be enabled for each NAPI using netlink interface.
Netlink, capital letter
> +For example, using following script:
> +
> +.. code-block:: bash
> +
> + $ ynl --family netdev --do napi-set \
> + --json='{"id": 66, "threaded": "busy-poll-enabled"}'
> +
> +
> +Enabling it for each NAPI allows finer control to enable busy pollling for
pollling -> polling
> +only a set of NIC queues which will get traffic with low latency requirements.
A bit of a non-sequitur. Sounds like you just cut off the device-wide
config here.
> +Depending on application requirement, user might want to set affinity of the
> +kthread that is busy polling each NAPI. User might also want to set priority
> +and the scheduler of the thread depending on the latency requirements.
> +
> +For a hard low-latency application, user might want to dedicate the full core
> +for the NAPI polling so the NIC queue descriptors are picked up from the queue
> +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
> +continuously without sleeping. This will keep the CPU core busy with 100%
> +usage. For more relaxed low-latency requirement, user might want to share the
> +core with other threads by setting thread affinity and priority.
Is there such a thing a priority in the Linux scheduler? Being more
specific would be useful. I think this code is useful for forwarding
or AF_XDP but normal socket applications would struggle to use this
mode.
> Threaded NAPI
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f3a3b761abfb..a88f6596aef7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -427,6 +427,8 @@ enum {
> NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
> NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
> NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
> + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
> + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
I don't get why you need 2 bits to implement this feature.
> @@ -1873,7 +1881,8 @@ enum netdev_reg_state {
> * @addr_len: Hardware address length
> * @upper_level: Maximum depth level of upper devices.
> * @lower_level: Maximum depth level of lower devices.
> - * @threaded: napi threaded state.
> + * @threaded: napi threaded mode is disabled, enabled or
> + * enabled with busy polling.
And you are still updating the device level kdoc.
> * @neigh_priv_len: Used in neigh_alloc()
> * @dev_id: Used to differentiate devices that share
> * the same link layer address
> @@ -78,6 +78,7 @@
> #include <linux/slab.h>
> #include <linux/sched.h>
> #include <linux/sched/isolation.h>
> +#include <linux/sched/types.h>
Leftover from experiments with setting scheduler params in the core?
Or dare I say Google prod kernel?
> #include <linux/sched/mm.h>
> #include <linux/smpboot.h>
> #include <linux/mutex.h>
> @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> * the guarantee we will be called later.
> */
> if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> - NAPIF_STATE_IN_BUSY_POLL)))
> + NAPIF_STATE_IN_BUSY_POLL |
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
Why not just set the IN_BUSY_POLL when the thread starts polling?
What's the significance of the distinction?
> return false;
>
> if (work_done) {
> +static void napi_set_threaded_state(struct napi_struct *napi,
> + enum netdev_napi_threaded threaded)
> +{
> + unsigned long val;
> +
> + val = 0;
> + if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
> + val |= NAPIF_STATE_THREADED_BUSY_POLL;
> + if (threaded)
> + val |= NAPIF_STATE_THREADED;
this reads odd, set threaded first then the sub-option
> + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
Does this actually have to be atomic? I don't think so.
> +}
> +
> int napi_set_threaded(struct napi_struct *napi,
> enum netdev_napi_threaded threaded)
> {
> @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi,
> } else {
> /* Make sure kthread is created before THREADED bit is set. */
> smp_mb__before_atomic();
> - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> + napi_set_threaded_state(napi, threaded);
> }
>
> return 0;
> @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n)
> }
>
> new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> - new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> + new &= ~(NAPIF_STATE_THREADED
> + | NAPIF_STATE_THREADED_BUSY_POLL
> + | NAPIF_STATE_PREFER_BUSY_POLL);
kernel coding style has | at the end of the line.
> } while (!try_cmpxchg(&n->state, &val, new));
>
> hrtimer_cancel(&n->timer);
> @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n)
>
> new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
> if (n->dev->threaded && n->thread)
> - new |= NAPIF_STATE_THREADED;
> + napi_set_threaded_state(n, n->dev->threaded);
> } while (!try_cmpxchg(&n->state, &val, new));
> }
> EXPORT_SYMBOL(napi_enable_locked);
> @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi)
> return -1;
> }
>
> -static void napi_threaded_poll_loop(struct napi_struct *napi)
> +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
> {
> struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> struct softnet_data *sd;
> @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
> }
> skb_defer_free_flush(sd);
> bpf_net_ctx_clear(bpf_net_ctx);
> +
> + /* Flush too old packets. If HZ < 1000, flush all packets */
Probably better to say something about the condition than just copy
the comment third time.
> + if (busy_poll)
> + gro_flush_normal(&napi->gro, HZ >= 1000);
> local_bh_enable();
>
> - if (!repoll)
> + /* If busy polling then do not break here because we need to
> + * call cond_resched and rcu_softirq_qs_periodic to prevent
> + * watchdog warnings.
> + */
> + if (!repoll && !busy_poll)
> break;
>
> rcu_softirq_qs_periodic(last_qs);
> cond_resched();
> +
> + if (!repoll)
> + break;
> }
> }
>
> static int napi_threaded_poll(void *data)
> {
> struct napi_struct *napi = data;
> + bool busy_poll_sched;
> + unsigned long val;
> + bool busy_poll;
> +
> + while (!napi_thread_wait(napi)) {
> + /* Once woken up, this means that we are scheduled as threaded
> + * napi and this thread owns the napi context, if busy poll
please capitalize NAPI in all comments and docs
> + * state is set then busy poll this napi.
> + */
> + val = READ_ONCE(napi->state);
> + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> +
> + /* Do not busy poll if napi is disabled. */
It's not disabled, disable is pending
> + if (unlikely(val & NAPIF_STATE_DISABLE))
> + busy_poll = false;
> +
> + if (busy_poll != busy_poll_sched) {
> + val = busy_poll ?
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL :
> + 0;
> + set_mask_bits(&napi->state,
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL,
> + val);
why set single bit with set_mask_bits() ?
Please improve, IIRC it took quite a lot of reviewer effort to get
the thread stopping into shape, similar level of effort will not be
afforded again.
Powered by blists - more mailing lists