[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd41a7d6-892f-47a9-8ac7-80d7ac8b5d7f@intel.com>
Date: Thu, 2 Jan 2025 13:16:24 -0800
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Samiullah Khawaja <skhawaja@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
"David S . Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>
CC: <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next 3/3] Extend napi threaded polling to allow
kthread based busy polling
On 1/2/2025 11:12 AM, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
>
> - STATE_THREADED_BUSY_POLL
> Threaded busy poll is enabled/running for this napi.
>
> Following changes are introduced in the napi scheduling and state logic:
>
> - When threaded busy poll is enabled through sysfs it also enables
> NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
> supposed to busy poll for each napi.
Looks like this patch is changing the sysfs 'threaded' field from
boolean to an integer and value 2 is used to indicate threaded mode with
busypoll.
So I think the above comment should reflect that instead of just saying
enabled for both threaded and busypoll.
>
> - When napi is scheduled with STATE_SCHED_THREADED and associated
> kthread is woken up, the kthread owns the context. If
> NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
> then it means that we can busy poll.
>
> - To keep busy polling and to avoid scheduling of the interrupts, the
> napi_complete_done returns false when both SCHED_THREADED and
> THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
> early to avoid the STATE_SCHED_THREADED being unset.
>
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
> napi_complete_done will run and unset the SCHED_THREADED bit also.
> This will make the associated kthread go to sleep as per existing
> logic.
When does STATE_THREADED_BUSY_POLL get unset? Don't we need a timeout
value to come out of busypoll mode if there is no traffic?
>
> Signed-off-by: Samiullah Khawaja <skhawaja@...gle.com>
> Reviewed-by: Willem de Bruijn <willemb@...gle.com>
> ---
> Documentation/ABI/testing/sysfs-class-net | 3 +-
> Documentation/netlink/specs/netdev.yaml | 5 +-
> .../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +-
> include/linux/netdevice.h | 24 +++++--
> net/core/dev.c | 72 ++++++++++++++++---
> net/core/net-sysfs.c | 2 +-
> net/core/netdev-genl-gen.c | 2 +-
> 7 files changed, 89 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
> index ebf21beba846..15d7d36a8294 100644
> --- a/Documentation/ABI/testing/sysfs-class-net
> +++ b/Documentation/ABI/testing/sysfs-class-net
> @@ -343,7 +343,7 @@ Date: Jan 2021
> KernelVersion: 5.12
> Contact: netdev@...r.kernel.org
> Description:
> - Boolean value to control the threaded mode per device. User could
> + Integer value to control the threaded mode per device. User could
> set this value to enable/disable threaded mode for all napi
> belonging to this device, without the need to do device up/down.
>
> @@ -351,4 +351,5 @@ Description:
> == ==================================
> 0 threaded mode disabled for this dev
> 1 threaded mode enabled for this dev
> + 2 threaded mode enabled, and busy polling enabled.
> == ==================================
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index aac343af7246..9c905243a1cc 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -272,10 +272,11 @@ attribute-sets:
> name: threaded
> doc: Whether the napi is configured to operate in threaded polling
> mode. If this is set to `1` then the NAPI context operates
> - in threaded polling mode.
> + in threaded polling mode. If this is set to `2` then the NAPI
> + kthread also does busypolling.
> type: u32
> checks:
> - max: 1
> + max: 2
> -
> name: queue
> attributes:
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index c571614b1d50..a709cddcd292 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> adapter->mii.mdio_write = atl1c_mdio_write;
> adapter->mii.phy_id_mask = 0x1f;
> adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
> - dev_set_threaded(netdev, true);
> + dev_set_threaded(netdev, DEV_NAPI_THREADED);
> for (i = 0; i < adapter->rx_queue_count; ++i)
> netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
> atl1c_clean_rx);
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8f531d528869..c384ffe0976e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -407,6 +407,8 @@ enum {
> NAPI_STATE_PREFER_BUSY_POLL, /* prefer busy-polling over softirq processing*/
> NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
> NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
> + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
> + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
> };
>
> enum {
> @@ -420,8 +422,14 @@ enum {
> NAPIF_STATE_PREFER_BUSY_POLL = BIT(NAPI_STATE_PREFER_BUSY_POLL),
> NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
> NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED),
> + NAPIF_STATE_THREADED_BUSY_POLL = BIT(NAPI_STATE_THREADED_BUSY_POLL),
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL
> + = BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
> };
>
> +#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
> + (NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
> +
> enum gro_result {
> GRO_MERGED,
> GRO_MERGED_FREE,
> @@ -568,16 +576,24 @@ static inline bool napi_complete(struct napi_struct *n)
> return napi_complete_done(n, 0);
> }
>
> -int dev_set_threaded(struct net_device *dev, bool threaded);
> +enum napi_threaded_state {
> + NAPI_THREADED_OFF = 0,
> + NAPI_THREADED = 1,
> + NAPI_THREADED_BUSY_POLL = 2,
> + NAPI_THREADED_MAX = NAPI_THREADED_BUSY_POLL,
> +};
> +
> +int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded);
>
> /*
> * napi_set_threaded - set napi threaded state
> * @napi: NAPI context
> - * @threaded: whether this napi does threaded polling
> + * @threaded: threading mode
> *
> * Return 0 on success and negative errno on failure.
> */
> -int napi_set_threaded(struct napi_struct *napi, bool threaded);
> +int napi_set_threaded(struct napi_struct *napi,
> + enum napi_threaded_state threaded);
>
> /**
> * napi_disable - prevent NAPI from scheduling
> @@ -2406,7 +2422,7 @@ struct net_device {
> struct sfp_bus *sfp_bus;
> struct lock_class_key *qdisc_tx_busylock;
> bool proto_down;
> - bool threaded;
> + u8 threaded;
>
> /* priv_flags_slow, ungrouped to save space */
> unsigned long see_all_hwtstamp_requests:1;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 762977a62da2..b6cd9474bdd3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -78,6 +78,7 @@
> #include <linux/slab.h>
> #include <linux/sched.h>
> #include <linux/sched/isolation.h>
> +#include <linux/sched/types.h>
> #include <linux/sched/mm.h>
> #include <linux/smpboot.h>
> #include <linux/mutex.h>
> @@ -6231,7 +6232,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> * the guarantee we will be called later.
> */
> if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> - NAPIF_STATE_IN_BUSY_POLL)))
> + NAPIF_STATE_IN_BUSY_POLL |
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
> return false;
>
> if (work_done) {
> @@ -6633,8 +6635,10 @@ static void init_gro_hash(struct napi_struct *napi)
> napi->gro_bitmask = 0;
> }
>
> -int napi_set_threaded(struct napi_struct *napi, bool threaded)
> +int napi_set_threaded(struct napi_struct *napi,
> + enum napi_threaded_state threaded)
> {
> + unsigned long val;
> if (napi->dev->threaded)
> return -EINVAL;
>
> @@ -6649,30 +6653,41 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
>
> /* Make sure kthread is created before THREADED bit is set. */
> smp_mb__before_atomic();
> - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> + val = 0;
> + if (threaded == NAPI_THREADED_BUSY_POLL)
> + val |= NAPIF_STATE_THREADED_BUSY_POLL;
> + if (threaded)
> + val |= NAPIF_STATE_THREADED;
> + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>
> return 0;
> }
>
> -int dev_set_threaded(struct net_device *dev, bool threaded)
> +int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded)
> {
> struct napi_struct *napi;
> + unsigned long val;
> int err = 0;
>
> if (dev->threaded == threaded)
> return 0;
>
> + val = 0;
> if (threaded) {
> /* Check if threaded is set at napi level already */
> list_for_each_entry(napi, &dev->napi_list, dev_list)
> if (test_bit(NAPI_STATE_THREADED, &napi->state))
> return -EINVAL;
>
> + val |= NAPIF_STATE_THREADED;
> + if (threaded == NAPI_THREADED_BUSY_POLL)
> + val |= NAPIF_STATE_THREADED_BUSY_POLL;
> +
> list_for_each_entry(napi, &dev->napi_list, dev_list) {
> if (!napi->thread) {
> err = napi_kthread_create(napi);
> if (err) {
> - threaded = false;
> + threaded = NAPI_THREADED_OFF;
> break;
> }
> }
> @@ -6691,9 +6706,13 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
> * polled. In this case, the switch between threaded mode and
> * softirq mode will happen in the next round of napi_schedule().
> * This should not cause hiccups/stalls to the live traffic.
> + *
> + * Switch to busy_poll threaded napi will occur after the threaded
> + * napi is scheduled.
> */
> list_for_each_entry(napi, &dev->napi_list, dev_list)
> - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> + set_mask_bits(&napi->state,
> + NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>
> return err;
> }
> @@ -7007,7 +7026,7 @@ static int napi_thread_wait(struct napi_struct *napi)
> return -1;
> }
>
> -static void napi_threaded_poll_loop(struct napi_struct *napi)
> +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
> {
> struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> struct softnet_data *sd;
> @@ -7036,22 +7055,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
> }
> skb_defer_free_flush(sd);
> bpf_net_ctx_clear(bpf_net_ctx);
> +
> + /* Push the skbs up the stack if busy polling. */
> + if (busy_poll)
> + __napi_gro_flush_helper(napi);
> local_bh_enable();
>
> - if (!repoll)
> + /* If busy polling then do not break here because we need to
> + * call cond_resched and rcu_softirq_qs_periodic to prevent
> + * watchdog warnings.
> + */
> + if (!repoll && !busy_poll)
> break;
>
> rcu_softirq_qs_periodic(last_qs);
> cond_resched();
> +
> + if (!repoll)
> + break;
> }
> }
>
> static int napi_threaded_poll(void *data)
> {
> struct napi_struct *napi = data;
> + bool busy_poll_sched;
> + unsigned long val;
> + bool busy_poll;
> +
> + while (!napi_thread_wait(napi)) {
> + /* Once woken up, this means that we are scheduled as threaded
> + * napi and this thread owns the napi context, if busy poll
> + * state is set then we busy poll this napi.
> + */
> + val = READ_ONCE(napi->state);
> + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> +
> + /* Do not busy poll if napi is disabled. */
> + if (unlikely(val & NAPIF_STATE_DISABLE))
> + busy_poll = false;
> +
> + if (busy_poll != busy_poll_sched)
> + assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
> + &napi->state, busy_poll);
>
> - while (!napi_thread_wait(napi))
> - napi_threaded_poll_loop(napi);
> + napi_threaded_poll_loop(napi, busy_poll);
> + }
>
> return 0;
> }
> @@ -12205,7 +12255,7 @@ static void run_backlog_napi(unsigned int cpu)
> {
> struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
>
> - napi_threaded_poll_loop(&sd->backlog);
> + napi_threaded_poll_loop(&sd->backlog, false);
> }
>
> static void backlog_napi_setup(unsigned int cpu)
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 2d9afc6e2161..36d0a22e341c 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -626,7 +626,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
> if (list_empty(&dev->napi_list))
> return -EOPNOTSUPP;
>
> - if (val != 0 && val != 1)
> + if (val > NAPI_THREADED_MAX)
> return -EOPNOTSUPP;
>
> ret = dev_set_threaded(dev, val);
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index 93dc74dad6de..4086d2577dcc 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -102,7 +102,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
> /* NETDEV_CMD_NAPI_SET_THREADED - do */
> static const struct nla_policy netdev_napi_set_threaded_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
> [NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
> - [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
> + [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
> };
>
> /* Ops table for netdev */
Powered by blists - more mailing lists