lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250912184608.6c6a7c51@kernel.org>
Date: Fri, 12 Sep 2025 18:46:08 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: "David S . Miller " <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 almasrymina@...gle.com, willemb@...gle.com, Joe Damato <joe@...a.to>,
 mkarsten@...terloo.ca, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow
 kthread based busy polling

On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - NAPI_STATE_THREADED_BUSY_POLL
>   Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through netlink it also enables
>   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
>   going to busy poll the napi.
> 
> - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
>   kthread is woken up, the kthread owns the context. If
>   NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
>   set then it means that kthread can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>   napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
>   and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
>   napi_complete_done returns early to avoid the
>   NAPI_STATE_SCHED_THREADED being unset.
> 
> - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
>   napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
>   bit also. This will make the associated kthread go to sleep as per
>   existing logic.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@...gle.com>

I think you need to spend some time trying to make this code more..
elegant. 

> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c035dc0f64fd..ce28e8708a87 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -88,7 +88,7 @@ definitions:
>    -
>      name: napi-threaded
>      type: enum
> -    entries: [disabled, enabled]
> +    entries: [disabled, enabled, busy-poll-enabled]

drop the -enabled

>  attribute-sets:
>    -
> @@ -291,7 +291,8 @@ attribute-sets:
>          name: threaded
>          doc: Whether the NAPI is configured to operate in threaded polling
>               mode. If this is set to enabled then the NAPI context operates
> -             in threaded polling mode.
> +             in threaded polling mode. If this is set to busy-poll-enabled
> +             then the NAPI kthread also does busypolling.

I don't think busypolling is a word? I mean, I don't think English
combines words like this.

> +Threaded NAPI busy polling
> +--------------------------

Please feed the documentation into a grammar checker. A bunch of
articles seems to be missing.

> +Threaded NAPI allows processing of packets from each NAPI in a kthread in
> +kernel. Threaded NAPI busy polling extends this and adds support to do
> +continuous busy polling of this NAPI. This can be used to enable busy polling
> +independent of userspace application or the API (epoll, io_uring, raw sockets)
> +being used in userspace to process the packets.
> +
> +It can be enabled for each NAPI using netlink interface.

Netlink, capital letter

> +For example, using following script:
> +
> +.. code-block:: bash
> +
> +  $ ynl --family netdev --do napi-set \
> +            --json='{"id": 66, "threaded": "busy-poll-enabled"}'
> +
> +
> +Enabling it for each NAPI allows finer control to enable busy pollling for

pollling -> polling

> +only a set of NIC queues which will get traffic with low latency requirements.

A bit of a non-sequitur. Sounds like you just cut off the device-wide
config here.

> +Depending on application requirement, user might want to set affinity of the
> +kthread that is busy polling each NAPI. User might also want to set priority
> +and the scheduler of the thread depending on the latency requirements.
> +
> +For a hard low-latency application, user might want to dedicate the full core
> +for the NAPI polling so the NIC queue descriptors are picked up from the queue
> +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
> +continuously without sleeping. This will keep the CPU core busy with 100%
> +usage. For more relaxed low-latency requirement, user might want to share the
> +core with other threads by setting thread affinity and priority.

Is there such a thing a priority in the Linux scheduler? Being more
specific would be useful. I think this code is useful for forwarding
or AF_XDP but normal socket applications would struggle to use this
mode.

>  Threaded NAPI
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f3a3b761abfb..a88f6596aef7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -427,6 +427,8 @@ enum {
>  	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
>  	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
>  	NAPI_STATE_HAS_NOTIFIER,	/* Napi has an IRQ notifier */
> +	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
> +	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */

I don't get why you need 2 bits to implement this feature.

> @@ -1873,7 +1881,8 @@ enum netdev_reg_state {
>   * 	@addr_len:		Hardware address length
>   *	@upper_level:		Maximum depth level of upper devices.
>   *	@lower_level:		Maximum depth level of lower devices.
> - *	@threaded:		napi threaded state.
> + *	@threaded:		napi threaded mode is disabled, enabled or
> + *				enabled with busy polling.

And you are still updating the device level kdoc.

>   *	@neigh_priv_len:	Used in neigh_alloc()
>   * 	@dev_id:		Used to differentiate devices that share
>   * 				the same link layer address

> @@ -78,6 +78,7 @@
>  #include <linux/slab.h>
>  #include <linux/sched.h>
>  #include <linux/sched/isolation.h>
> +#include <linux/sched/types.h>

Leftover from experiments with setting scheduler params in the core?
Or dare I say Google prod kernel?

>  #include <linux/sched/mm.h>
>  #include <linux/smpboot.h>
>  #include <linux/mutex.h>
> @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>  	 *    the guarantee we will be called later.
>  	 */
>  	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> -				 NAPIF_STATE_IN_BUSY_POLL)))
> +				 NAPIF_STATE_IN_BUSY_POLL |
> +				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))

Why not just set the IN_BUSY_POLL when the thread starts polling?
What's the significance of the distinction?

>  		return false;
>  
>  	if (work_done) {

> +static void napi_set_threaded_state(struct napi_struct *napi,
> +				    enum netdev_napi_threaded threaded)
> +{
> +	unsigned long val;
> +
> +	val = 0;
> +	if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
> +		val |= NAPIF_STATE_THREADED_BUSY_POLL;
> +	if (threaded)
> +		val |= NAPIF_STATE_THREADED;

this reads odd, set threaded first then the sub-option

> +	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);

Does this actually have to be atomic? I don't think so.

> +}
> +
>  int napi_set_threaded(struct napi_struct *napi,
>  		      enum netdev_napi_threaded threaded)
>  {
> @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi,
>  	} else {
>  		/* Make sure kthread is created before THREADED bit is set. */
>  		smp_mb__before_atomic();
> -		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +		napi_set_threaded_state(napi, threaded);
>  	}
>  
>  	return 0;
> @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n)
>  		}
>  
>  		new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> -		new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> +		new &= ~(NAPIF_STATE_THREADED
> +			 | NAPIF_STATE_THREADED_BUSY_POLL
> +			 | NAPIF_STATE_PREFER_BUSY_POLL);

kernel coding style has | at the end of the line.

>  	} while (!try_cmpxchg(&n->state, &val, new));
>  
>  	hrtimer_cancel(&n->timer);
> @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n)
>  
>  		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
>  		if (n->dev->threaded && n->thread)
> -			new |= NAPIF_STATE_THREADED;
> +			napi_set_threaded_state(n, n->dev->threaded);
>  	} while (!try_cmpxchg(&n->state, &val, new));
>  }
>  EXPORT_SYMBOL(napi_enable_locked);
> @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi)
>  	return -1;
>  }
>  
> -static void napi_threaded_poll_loop(struct napi_struct *napi)
> +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
>  {
>  	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>  	struct softnet_data *sd;
> @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
>  		}
>  		skb_defer_free_flush(sd);
>  		bpf_net_ctx_clear(bpf_net_ctx);
> +
> +		/* Flush too old packets. If HZ < 1000, flush all packets */

Probably better to say something about the condition than just copy 
the comment third time.

> +		if (busy_poll)
> +			gro_flush_normal(&napi->gro, HZ >= 1000);
>  		local_bh_enable();
>  
> -		if (!repoll)
> +		/* If busy polling then do not break here because we need to
> +		 * call cond_resched and rcu_softirq_qs_periodic to prevent
> +		 * watchdog warnings.
> +		 */
> +		if (!repoll && !busy_poll)
>  			break;
>  
>  		rcu_softirq_qs_periodic(last_qs);
>  		cond_resched();
> +
> +		if (!repoll)
> +			break;
>  	}
>  }
>  
>  static int napi_threaded_poll(void *data)
>  {
>  	struct napi_struct *napi = data;
> +	bool busy_poll_sched;
> +	unsigned long val;
> +	bool busy_poll;
> +
> +	while (!napi_thread_wait(napi)) {
> +		/* Once woken up, this means that we are scheduled as threaded
> +		 * napi and this thread owns the napi context, if busy poll

please capitalize NAPI in all comments and docs

> +		 * state is set then busy poll this napi.
> +		 */
> +		val = READ_ONCE(napi->state);
> +		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> +		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> +
> +		/* Do not busy poll if napi is disabled. */

It's not disabled, disable is pending

> +		if (unlikely(val & NAPIF_STATE_DISABLE))

> +			busy_poll = false;
> +
> +		if (busy_poll != busy_poll_sched) {
> +			val = busy_poll ?
> +					NAPIF_STATE_SCHED_THREADED_BUSY_POLL :
> +					0;
> +			set_mask_bits(&napi->state,
> +				      NAPIF_STATE_SCHED_THREADED_BUSY_POLL,
> +				      val);

why set single bit with set_mask_bits() ?

Please improve, IIRC it took quite a lot of reviewer effort to get 
the thread stopping into shape, similar level of effort will not be
afforded again.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ