lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAywjhR1yHShH_LdMW+s-L+Luv=jpWG3kr3eSu1h5w6tWFMCHw@mail.gmail.com>
Date: Mon, 15 Sep 2025 09:24:13 -0700
From: Samiullah Khawaja <skhawaja@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S . Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com, 
	Joe Damato <joe@...a.to>, mkarsten@...terloo.ca, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow
 kthread based busy polling

On Fri, Sep 12, 2025 at 6:46 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote:
> > Add a new state to napi state enum:
> >
> > - NAPI_STATE_THREADED_BUSY_POLL
> >   Threaded busy poll is enabled/running for this napi.
> >
> > Following changes are introduced in the napi scheduling and state logic:
> >
> > - When threaded busy poll is enabled through netlink it also enables
> >   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> >   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
> >   going to busy poll the napi.
> >
> > - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
> >   kthread is woken up, the kthread owns the context. If
> >   NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
> >   set then it means that kthread can busy poll.
> >
> > - To keep busy polling and to avoid scheduling of the interrupts, the
> >   napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
> >   and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
> >   napi_complete_done returns early to avoid the
> >   NAPI_STATE_SCHED_THREADED being unset.
> >
> > - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
> >   napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
> >   bit also. This will make the associated kthread go to sleep as per
> >   existing logic.
> >
> > Signed-off-by: Samiullah Khawaja <skhawaja@...gle.com>
>
> I think you need to spend some time trying to make this code more..
> elegant.
Thanks for the review and the feedback. I will work on it for the next revision.
>
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index c035dc0f64fd..ce28e8708a87 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -88,7 +88,7 @@ definitions:
> >    -
> >      name: napi-threaded
> >      type: enum
> > -    entries: [disabled, enabled]
> > +    entries: [disabled, enabled, busy-poll-enabled]
>
> drop the -enabled
Agreed
>
> >  attribute-sets:
> >    -
> > @@ -291,7 +291,8 @@ attribute-sets:
> >          name: threaded
> >          doc: Whether the NAPI is configured to operate in threaded polling
> >               mode. If this is set to enabled then the NAPI context operates
> > -             in threaded polling mode.
> > +             in threaded polling mode. If this is set to busy-poll-enabled
> > +             then the NAPI kthread also does busypolling.
>
> I don't think busypolling is a word? I mean, I don't think English
> combines words like this.
>
> > +Threaded NAPI busy polling
> > +--------------------------
>
> Please feed the documentation into a grammar checker. A bunch of
> articles seems to be missing.
>
> > +Threaded NAPI allows processing of packets from each NAPI in a kthread in
> > +kernel. Threaded NAPI busy polling extends this and adds support to do
> > +continuous busy polling of this NAPI. This can be used to enable busy polling
> > +independent of userspace application or the API (epoll, io_uring, raw sockets)
> > +being used in userspace to process the packets.
> > +
> > +It can be enabled for each NAPI using netlink interface.
>
> Netlink, capital letter
>
> > +For example, using following script:
> > +
> > +.. code-block:: bash
> > +
> > +  $ ynl --family netdev --do napi-set \
> > +            --json='{"id": 66, "threaded": "busy-poll-enabled"}'
> > +
> > +
> > +Enabling it for each NAPI allows finer control to enable busy pollling for
>
> pollling -> polling
>
> > +only a set of NIC queues which will get traffic with low latency requirements.
>
> A bit of a non-sequitur. Sounds like you just cut off the device-wide
> config here.
Will reword this.
>
> > +Depending on application requirement, user might want to set affinity of the
> > +kthread that is busy polling each NAPI. User might also want to set priority
> > +and the scheduler of the thread depending on the latency requirements.
> > +
> > +For a hard low-latency application, user might want to dedicate the full core
> > +for the NAPI polling so the NIC queue descriptors are picked up from the queue
> > +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
> > +continuously without sleeping. This will keep the CPU core busy with 100%
> > +usage. For more relaxed low-latency requirement, user might want to share the
> > +core with other threads by setting thread affinity and priority.
>
> Is there such a thing a priority in the Linux scheduler? Being more
> specific would be useful. I think this code is useful for forwarding
> or AF_XDP but normal socket applications would struggle to use this
> mode.
Will reword this.
>
> >  Threaded NAPI
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index f3a3b761abfb..a88f6596aef7 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -427,6 +427,8 @@ enum {
> >       NAPI_STATE_THREADED,            /* The poll is performed inside its own thread*/
> >       NAPI_STATE_SCHED_THREADED,      /* Napi is currently scheduled in threaded mode */
> >       NAPI_STATE_HAS_NOTIFIER,        /* Napi has an IRQ notifier */
> > +     NAPI_STATE_THREADED_BUSY_POLL,  /* The threaded napi poller will busy poll */
> > +     NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
>
> I don't get why you need 2 bits to implement this feature.
I will reuse the IN_BUSY_POLL bit as you suggested below.
>
> > @@ -1873,7 +1881,8 @@ enum netdev_reg_state {
> >   *   @addr_len:              Hardware address length
> >   *   @upper_level:           Maximum depth level of upper devices.
> >   *   @lower_level:           Maximum depth level of lower devices.
> > - *   @threaded:              napi threaded state.
> > + *   @threaded:              napi threaded mode is disabled, enabled or
> > + *                           enabled with busy polling.
>
> And you are still updating the device level kdoc.
>
> >   *   @neigh_priv_len:        Used in neigh_alloc()
> >   *   @dev_id:                Used to differentiate devices that share
> >   *                           the same link layer address
>
> > @@ -78,6 +78,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/sched.h>
> >  #include <linux/sched/isolation.h>
> > +#include <linux/sched/types.h>
>
> Leftover from experiments with setting scheduler params in the core?
> Or dare I say Google prod kernel?
>
> >  #include <linux/sched/mm.h>
> >  #include <linux/smpboot.h>
> >  #include <linux/mutex.h>
> > @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> >        *    the guarantee we will be called later.
> >        */
> >       if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> > -                              NAPIF_STATE_IN_BUSY_POLL)))
> > +                              NAPIF_STATE_IN_BUSY_POLL |
> > +                              NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
>
> Why not just set the IN_BUSY_POLL when the thread starts polling?
> What's the significance of the distinction?
Agreed. I think I can reuse the IN_BUSY_POLL bit.
>
> >               return false;
> >
> >       if (work_done) {
>
> > +static void napi_set_threaded_state(struct napi_struct *napi,
> > +                                 enum netdev_napi_threaded threaded)
> > +{
> > +     unsigned long val;
> > +
> > +     val = 0;
> > +     if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
> > +             val |= NAPIF_STATE_THREADED_BUSY_POLL;
> > +     if (threaded)
> > +             val |= NAPIF_STATE_THREADED;
>
> this reads odd, set threaded first then the sub-option
>
> > +     set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>
> Does this actually have to be atomic? I don't think so.
Agreed.
>
> > +}
> > +
> >  int napi_set_threaded(struct napi_struct *napi,
> >                     enum netdev_napi_threaded threaded)
> >  {
> > @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi,
> >       } else {
> >               /* Make sure kthread is created before THREADED bit is set. */
> >               smp_mb__before_atomic();
> > -             assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> > +             napi_set_threaded_state(napi, threaded);
> >       }
> >
> >       return 0;
> > @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n)
> >               }
> >
> >               new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> > -             new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> > +             new &= ~(NAPIF_STATE_THREADED
> > +                      | NAPIF_STATE_THREADED_BUSY_POLL
> > +                      | NAPIF_STATE_PREFER_BUSY_POLL);
>
> kernel coding style has | at the end of the line.
>
> >       } while (!try_cmpxchg(&n->state, &val, new));
> >
> >       hrtimer_cancel(&n->timer);
> > @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n)
> >
> >               new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
> >               if (n->dev->threaded && n->thread)
> > -                     new |= NAPIF_STATE_THREADED;
> > +                     napi_set_threaded_state(n, n->dev->threaded);
> >       } while (!try_cmpxchg(&n->state, &val, new));
> >  }
> >  EXPORT_SYMBOL(napi_enable_locked);
> > @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi)
> >       return -1;
> >  }
> >
> > -static void napi_threaded_poll_loop(struct napi_struct *napi)
> > +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
> >  {
> >       struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> >       struct softnet_data *sd;
> > @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
> >               }
> >               skb_defer_free_flush(sd);
> >               bpf_net_ctx_clear(bpf_net_ctx);
> > +
> > +             /* Flush too old packets. If HZ < 1000, flush all packets */
>
> Probably better to say something about the condition than just copy
> the comment third time.
>
> > +             if (busy_poll)
> > +                     gro_flush_normal(&napi->gro, HZ >= 1000);
> >               local_bh_enable();
> >
> > -             if (!repoll)
> > +             /* If busy polling then do not break here because we need to
> > +              * call cond_resched and rcu_softirq_qs_periodic to prevent
> > +              * watchdog warnings.
> > +              */
> > +             if (!repoll && !busy_poll)
> >                       break;
> >
> >               rcu_softirq_qs_periodic(last_qs);
> >               cond_resched();
> > +
> > +             if (!repoll)
> > +                     break;
> >       }
> >  }
> >
> >  static int napi_threaded_poll(void *data)
> >  {
> >       struct napi_struct *napi = data;
> > +     bool busy_poll_sched;
> > +     unsigned long val;
> > +     bool busy_poll;
> > +
> > +     while (!napi_thread_wait(napi)) {
> > +             /* Once woken up, this means that we are scheduled as threaded
> > +              * napi and this thread owns the napi context, if busy poll
>
> please capitalize NAPI in all comments and docs
>
> > +              * state is set then busy poll this napi.
> > +              */
> > +             val = READ_ONCE(napi->state);
> > +             busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> > +             busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> > +
> > +             /* Do not busy poll if napi is disabled. */
>
> It's not disabled, disable is pending
>
> > +             if (unlikely(val & NAPIF_STATE_DISABLE))
>
> > +                     busy_poll = false;
> > +
> > +             if (busy_poll != busy_poll_sched) {
> > +                     val = busy_poll ?
> > +                                     NAPIF_STATE_SCHED_THREADED_BUSY_POLL :
> > +                                     0;
> > +                     set_mask_bits(&napi->state,
> > +                                   NAPIF_STATE_SCHED_THREADED_BUSY_POLL,
> > +                                   val);
>
> why set single bit with set_mask_bits() ?
>
> Please improve, IIRC it took quite a lot of reviewer effort to get
> the thread stopping into shape, similar level of effort will not be
> afforded again.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ