[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c646d51c-4e91-bd86-9a5b-97b5a1ce33d0@redhat.com>
Date: Fri, 31 Mar 2023 11:31:31 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net
Cc: brouer@...hat.com, netdev@...r.kernel.org, edumazet@...gle.com,
pabeni@...hat.com, hawk@...nel.org, ilias.apalodimas@...aro.org
Subject: Re: [RFC net-next 1/2] page_pool: allow caching from safely localized
NAPI
On 31/03/2023 06.39, Jakub Kicinski wrote:
> Recent patches to mlx5 mentioned a regression when moving from
> driver local page pool to only using the generic page pool code.
> Page pool has two recycling paths (1) direct one, which runs in
> safe NAPI context (basically consumer context, so producing
> can be lockless); and (2) via a ptr_ring, which takes a spin
> lock because the freeing can happen from any CPU; producer
> and consumer may run concurrently.
>
> Since the page pool code was added, Eric introduced a revised version
> of deferred skb freeing. TCP skbs are now usually returned to the CPU
> which allocated them, and freed in softirq context. This places the
> freeing (producing of pages back to the pool) enticingly close to
> the allocation (consumer).
>
> If we can prove that we're freeing in the same softirq context in which
> the consumer NAPI will run - lockless use of the cache is perfectly fine,
> no need for the lock.
Super interesting! :-)
>
> Let drivers link the page pool to a NAPI instance. If the NAPI instance
> is scheduled on the same CPU on which we're freeing - place the pages
> in the direct cache.
>
Cool, using the direct cache this way.
In the cases where we cannot use the direct cache.
There is also the option of bulk freeing into page_pool which mitigate
the ptr_ring locking. See code page_pool_put_page_bulk().
For XDP the page_pool_put_page_bulk() API is used by
xdp_return_frame_bulk() and xdp_flush_frame_bulk(), which together
builds-up a bulk.
(p.s. I see we could optimize xdp_return_frame_bulk some more, and avoid
some of the rhashtable_lookup(), but that is unrelated to your patch).
> With that and patched bnxt (XDP enabled to engage the page pool, sigh,
> bnxt really needs page pool work :() I see a 2.6% perf boost with
> a TCP stream test (app on a different physical core than softirq).
>
> The CPU use of relevant functions decreases as expected:
>
> page_pool_refill_alloc_cache 1.17% -> 0%
> _raw_spin_lock 2.41% -> 0.98%
>
> Only consider lockless path to be safe when NAPI is scheduled
> - in practice this should cover majority if not all of steady state
> workloads. It's usually the NAPI kicking in that causes the skb flush.
>
Make sense, but do read the comment above struct pp_alloc_cache.
The sizing of pp_alloc_cache is important for this trick/heuristic to
work, meaning the pp_alloc_cache have enough room.
It is definitely on purpose that pp_alloc_cache have 128 elements and is
only refill'ed with 64 elements, which leaves room for this kind of
trick. But if Eric's deferred skb freeing have more than 64 pages to
free, then we will likely fallback to ptr_ring recycling.
Code wise, I suggest that you/we change page_pool_put_page_bulk() to
have a variant that 'allow_direct' (looking at code below, you might
already do this as this patch over-steer 'allow_direct'). Using the
bulk API, would then bulk into ptr_ring in the cases we cannot use
direct cache.
> The main case we'll miss out on is when application runs on the same
> CPU as NAPI. In that case we don't use the deferred skb free path.
> We could disable softirq one that path, too... maybe?
>
> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> ---
> CC: hawk@...nel.org
> CC: ilias.apalodimas@...aro.org
> ---
> include/linux/netdevice.h | 3 +++
> include/net/page_pool.h | 1 +
> net/core/dev.c | 3 +++
> net/core/page_pool.c | 16 ++++++++++++++++
> 4 files changed, 23 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 62e093a6d6d1..b3c11353078b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -360,8 +360,11 @@ struct napi_struct {
> unsigned long gro_bitmask;
> int (*poll)(struct napi_struct *, int);
> #ifdef CONFIG_NETPOLL
> + /* CPU actively polling if netpoll is configured */
> int poll_owner;
> #endif
> + /* CPU on which NAPI has been scheduled for processing */
> + int list_owner;
> struct net_device *dev;
> struct gro_list gro_hash[GRO_HASH_BUCKETS];
> struct sk_buff *skb;
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index ddfa0b328677..f86cdfb51585 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -77,6 +77,7 @@ struct page_pool_params {
> unsigned int pool_size;
> int nid; /* Numa node id to allocate from pages from */
> struct device *dev; /* device, for DMA pre-mapping purposes */
> + struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */
> enum dma_data_direction dma_dir; /* DMA mapping direction */
> unsigned int max_len; /* max DMA sync memory size */
> unsigned int offset; /* DMA addr offset */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0c4b21291348..a6d6e5c89ce7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4360,6 +4360,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> }
>
> list_add_tail(&napi->poll_list, &sd->poll_list);
> + WRITE_ONCE(napi->list_owner, smp_processor_id());
> /* If not called from net_rx_action()
> * we have to raise NET_RX_SOFTIRQ.
> */
> @@ -6070,6 +6071,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> list_del_init(&n->poll_list);
> local_irq_restore(flags);
> }
> + WRITE_ONCE(n->list_owner, -1);
>
> val = READ_ONCE(n->state);
> do {
> @@ -6385,6 +6387,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
> #ifdef CONFIG_NETPOLL
> napi->poll_owner = -1;
> #endif
> + napi->list_owner = -1;
> set_bit(NAPI_STATE_SCHED, &napi->state);
> set_bit(NAPI_STATE_NPSVC, &napi->state);
> list_add_rcu(&napi->dev_list, &dev->napi_list);
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 193c18799865..c3e2ab0c2684 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -19,6 +19,7 @@
> #include <linux/mm.h> /* for put_page() */
> #include <linux/poison.h>
> #include <linux/ethtool.h>
> +#include <linux/netdevice.h>
>
> #include <trace/events/page_pool.h>
>
> @@ -544,6 +545,18 @@ static bool page_pool_recycle_in_cache(struct page *page,
> return true;
> }
>
> +/* If caller didn't allow direct recycling check if we have other reasons
> + * to believe that the producer and consumer can't race.
> + *
> + * Result is only meaningful in softirq context.
> + */
> +static bool page_pool_safe_producer(struct page_pool *pool)
> +{
> + struct napi_struct *napi = pool->p.napi;
> +
> + return napi && READ_ONCE(napi->list_owner) == smp_processor_id();
> +}
> +
> /* If the page refcnt == 1, this will try to recycle the page.
> * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
> * the configured size min(dma_sync_size, pool->max_len).
> @@ -570,6 +583,9 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> page_pool_dma_sync_for_device(pool, page,
> dma_sync_size);
>
> + if (!allow_direct)
> + allow_direct = page_pool_safe_producer(pool);
> +
I remember some use-case for veth, that explicitly disables
"allow_direct". I cannot remember why exactly, but we need to make sure
that doesn't break something (as this code can undo the allow_direct).
> if (allow_direct && in_softirq() &&
> page_pool_recycle_in_cache(page, pool))
> return NULL;
Powered by blists - more mailing lists