[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhR1yHShH_LdMW+s-L+Luv=jpWG3kr3eSu1h5w6tWFMCHw@mail.gmail.com>
Date: Mon, 15 Sep 2025 09:24:13 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S . Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
Joe Damato <joe@...a.to>, mkarsten@...terloo.ca, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow
kthread based busy polling
On Fri, Sep 12, 2025 at 6:46 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote:
> > Add a new state to napi state enum:
> >
> > - NAPI_STATE_THREADED_BUSY_POLL
> > Threaded busy poll is enabled/running for this napi.
> >
> > Following changes are introduced in the napi scheduling and state logic:
> >
> > - When threaded busy poll is enabled through netlink it also enables
> > NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> > NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
> > going to busy poll the napi.
> >
> > - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
> > kthread is woken up, the kthread owns the context. If
> > NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
> > set then it means that kthread can busy poll.
> >
> > - To keep busy polling and to avoid scheduling of the interrupts, the
> > napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
> > and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
> > napi_complete_done returns early to avoid the
> > NAPI_STATE_SCHED_THREADED being unset.
> >
> > - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
> > napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
> > bit also. This will make the associated kthread go to sleep as per
> > existing logic.
> >
> > Signed-off-by: Samiullah Khawaja <skhawaja@...gle.com>
>
> I think you need to spend some time trying to make this code more..
> elegant.
Thanks for the review and the feedback. I will work on it for the next revision.
>
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index c035dc0f64fd..ce28e8708a87 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -88,7 +88,7 @@ definitions:
> > -
> > name: napi-threaded
> > type: enum
> > - entries: [disabled, enabled]
> > + entries: [disabled, enabled, busy-poll-enabled]
>
> drop the -enabled
Agreed
>
> > attribute-sets:
> > -
> > @@ -291,7 +291,8 @@ attribute-sets:
> > name: threaded
> > doc: Whether the NAPI is configured to operate in threaded polling
> > mode. If this is set to enabled then the NAPI context operates
> > - in threaded polling mode.
> > + in threaded polling mode. If this is set to busy-poll-enabled
> > + then the NAPI kthread also does busypolling.
>
> I don't think busypolling is a word? I mean, I don't think English
> combines words like this.
>
> > +Threaded NAPI busy polling
> > +--------------------------
>
> Please feed the documentation into a grammar checker. A bunch of
> articles seems to be missing.
>
> > +Threaded NAPI allows processing of packets from each NAPI in a kthread in
> > +kernel. Threaded NAPI busy polling extends this and adds support to do
> > +continuous busy polling of this NAPI. This can be used to enable busy polling
> > +independent of userspace application or the API (epoll, io_uring, raw sockets)
> > +being used in userspace to process the packets.
> > +
> > +It can be enabled for each NAPI using netlink interface.
>
> Netlink, capital letter
>
> > +For example, using following script:
> > +
> > +.. code-block:: bash
> > +
> > + $ ynl --family netdev --do napi-set \
> > + --json='{"id": 66, "threaded": "busy-poll-enabled"}'
> > +
> > +
> > +Enabling it for each NAPI allows finer control to enable busy pollling for
>
> pollling -> polling
>
> > +only a set of NIC queues which will get traffic with low latency requirements.
>
> A bit of a non-sequitur. Sounds like you just cut off the device-wide
> config here.
Will reword this.
>
> > +Depending on application requirement, user might want to set affinity of the
> > +kthread that is busy polling each NAPI. User might also want to set priority
> > +and the scheduler of the thread depending on the latency requirements.
> > +
> > +For a hard low-latency application, user might want to dedicate the full core
> > +for the NAPI polling so the NIC queue descriptors are picked up from the queue
> > +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
> > +continuously without sleeping. This will keep the CPU core busy with 100%
> > +usage. For more relaxed low-latency requirement, user might want to share the
> > +core with other threads by setting thread affinity and priority.
>
> Is there such a thing a priority in the Linux scheduler? Being more
> specific would be useful. I think this code is useful for forwarding
> or AF_XDP but normal socket applications would struggle to use this
> mode.
Will reword this.
>
> > Threaded NAPI
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index f3a3b761abfb..a88f6596aef7 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -427,6 +427,8 @@ enum {
> > NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
> > NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
> > NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
> > + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
> > + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
>
> I don't get why you need 2 bits to implement this feature.
I will reuse the IN_BUSY_POLL bit as you suggested below.
>
> > @@ -1873,7 +1881,8 @@ enum netdev_reg_state {
> > * @addr_len: Hardware address length
> > * @upper_level: Maximum depth level of upper devices.
> > * @lower_level: Maximum depth level of lower devices.
> > - * @threaded: napi threaded state.
> > + * @threaded: napi threaded mode is disabled, enabled or
> > + * enabled with busy polling.
>
> And you are still updating the device level kdoc.
>
> > * @neigh_priv_len: Used in neigh_alloc()
> > * @dev_id: Used to differentiate devices that share
> > * the same link layer address
>
> > @@ -78,6 +78,7 @@
> > #include <linux/slab.h>
> > #include <linux/sched.h>
> > #include <linux/sched/isolation.h>
> > +#include <linux/sched/types.h>
>
> Leftover from experiments with setting scheduler params in the core?
> Or dare I say Google prod kernel?
>
> > #include <linux/sched/mm.h>
> > #include <linux/smpboot.h>
> > #include <linux/mutex.h>
> > @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> > * the guarantee we will be called later.
> > */
> > if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> > - NAPIF_STATE_IN_BUSY_POLL)))
> > + NAPIF_STATE_IN_BUSY_POLL |
> > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
>
> Why not just set the IN_BUSY_POLL when the thread starts polling?
> What's the significance of the distinction?
Agreed. I think I can reuse the IN_BUSY_POLL bit.
>
> > return false;
> >
> > if (work_done) {
>
> > +static void napi_set_threaded_state(struct napi_struct *napi,
> > + enum netdev_napi_threaded threaded)
> > +{
> > + unsigned long val;
> > +
> > + val = 0;
> > + if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
> > + val |= NAPIF_STATE_THREADED_BUSY_POLL;
> > + if (threaded)
> > + val |= NAPIF_STATE_THREADED;
>
> this reads odd, set threaded first then the sub-option
>
> > + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>
> Does this actually have to be atomic? I don't think so.
Agreed.
>
> > +}
> > +
> > int napi_set_threaded(struct napi_struct *napi,
> > enum netdev_napi_threaded threaded)
> > {
> > @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi,
> > } else {
> > /* Make sure kthread is created before THREADED bit is set. */
> > smp_mb__before_atomic();
> > - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> > + napi_set_threaded_state(napi, threaded);
> > }
> >
> > return 0;
> > @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n)
> > }
> >
> > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> > - new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> > + new &= ~(NAPIF_STATE_THREADED
> > + | NAPIF_STATE_THREADED_BUSY_POLL
> > + | NAPIF_STATE_PREFER_BUSY_POLL);
>
> kernel coding style has | at the end of the line.
>
> > } while (!try_cmpxchg(&n->state, &val, new));
> >
> > hrtimer_cancel(&n->timer);
> > @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n)
> >
> > new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
> > if (n->dev->threaded && n->thread)
> > - new |= NAPIF_STATE_THREADED;
> > + napi_set_threaded_state(n, n->dev->threaded);
> > } while (!try_cmpxchg(&n->state, &val, new));
> > }
> > EXPORT_SYMBOL(napi_enable_locked);
> > @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi)
> > return -1;
> > }
> >
> > -static void napi_threaded_poll_loop(struct napi_struct *napi)
> > +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
> > {
> > struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> > struct softnet_data *sd;
> > @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
> > }
> > skb_defer_free_flush(sd);
> > bpf_net_ctx_clear(bpf_net_ctx);
> > +
> > + /* Flush too old packets. If HZ < 1000, flush all packets */
>
> Probably better to say something about the condition than just copy
> the comment third time.
>
> > + if (busy_poll)
> > + gro_flush_normal(&napi->gro, HZ >= 1000);
> > local_bh_enable();
> >
> > - if (!repoll)
> > + /* If busy polling then do not break here because we need to
> > + * call cond_resched and rcu_softirq_qs_periodic to prevent
> > + * watchdog warnings.
> > + */
> > + if (!repoll && !busy_poll)
> > break;
> >
> > rcu_softirq_qs_periodic(last_qs);
> > cond_resched();
> > +
> > + if (!repoll)
> > + break;
> > }
> > }
> >
> > static int napi_threaded_poll(void *data)
> > {
> > struct napi_struct *napi = data;
> > + bool busy_poll_sched;
> > + unsigned long val;
> > + bool busy_poll;
> > +
> > + while (!napi_thread_wait(napi)) {
> > + /* Once woken up, this means that we are scheduled as threaded
> > + * napi and this thread owns the napi context, if busy poll
>
> please capitalize NAPI in all comments and docs
>
> > + * state is set then busy poll this napi.
> > + */
> > + val = READ_ONCE(napi->state);
> > + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> > + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> > +
> > + /* Do not busy poll if napi is disabled. */
>
> It's not disabled, disable is pending
>
> > + if (unlikely(val & NAPIF_STATE_DISABLE))
>
> > + busy_poll = false;
> > +
> > + if (busy_poll != busy_poll_sched) {
> > + val = busy_poll ?
> > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL :
> > + 0;
> > + set_mask_bits(&napi->state,
> > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL,
> > + val);
>
> why set single bit with set_mask_bits() ?
>
> Please improve, IIRC it took quite a lot of reviewer effort to get
> the thread stopping into shape, similar level of effort will not be
> afforded again.
Powered by blists - more mailing lists