netdev - Re: [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 18 Dec 2019 08:44:37 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     netdev@...r.kernel.org
Cc:     lirongqing@...du.com, linyunsheng@...wei.com,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Saeed Mahameed <saeedm@...lanox.com>, mhocko@...nel.org,
        peterz@...radead.org, linux-kernel@...r.kernel.org,
        brouer@...hat.com
Subject: Re: [net-next v3 PATCH] page_pool: handle page recycle for
 NUMA_NO_NODE condition

On Wed, 18 Dec 2019 00:17:36 +0100
Jesper Dangaard Brouer <brouer@...hat.com> wrote:

> The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> 
> The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> recycle non-reusable pages"), were to have RX-pages that belongs to the
> same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> illustrated by the performance measurements.
> 
> This patch moves the NAPI checks out of fast-path, and at the same time
> solves the NUMA_NO_NODE issue.
> 
> First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> as the the preferred nid.  It is only in rare situations, where
> e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> preferred nid.  The page_pool API allows drivers to control the nid
> themselves via controlling pool->p.nid.
> 
> This patch moves the NAPI check to when alloc cache is refilled, via
> dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> will check the NUMA node. All current drivers using page_pool will
> alloc/refill RX-ring from same CPU running softirq/NAPI process.
> 
> Drivers that control the nid explicitly, also use page_pool_update_nid
> when changing nid runtime.  To speed up transision to new nid the alloc
> cache is now flushed on nid changes.  This force pages to come from
> ptr_ring, which does the appropate nid check.
> 
> For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> chunks per allocation and allocation fall-through to the real
> page-allocator with the new nid derived from numa_mem_id(). We accept
> that transitioning the alloc cache doesn't happen immediately.
> 
> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> Reported-by: Li RongQing <lirongqing@...du.com>
> Reported-by: Yunsheng Lin <linyunsheng@...wei.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@...hat.com>
> ---
>  net/core/page_pool.c |   64 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 49 insertions(+), 15 deletions(-)

I'm going to send a V4, because GCC doesn't generate the optimal ASM
code for the fast-path (details below).

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..37316ea66937 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -96,19 +96,22 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>  }
>  EXPORT_SYMBOL(page_pool_create);
>  
> +static void __page_pool_return_page(struct page_pool *pool, struct page *page);
> +
>  /* fast path */
>  static struct page *__page_pool_get_cached(struct page_pool *pool)
>  {
>  	struct ptr_ring *r = &pool->ring;
> +	struct page *first_page, *page;
>  	bool refill = false;
> -	struct page *page;
> +	int i, curr_nid;
>  
>  	/* Test for safe-context, caller should provide this guarantee */
>  	if (likely(in_serving_softirq())) {
>  		if (likely(pool->alloc.count)) {
>  			/* Fast-path */
> -			page = pool->alloc.cache[--pool->alloc.count];
> -			return page;
> +			first_page = pool->alloc.cache[--pool->alloc.count];
> +			return first_page;
>  		}
>  		refill = true;
>  	}

The compiler (gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)) doesn't
generate optimal ASM code for the likely fast-path of pool->alloc.cache
containing an element.  It does "isolate" it and return (retq) early in
this likely case, BUT it chooses to use %r15 (a call-preserved
register, aka callee-saved register).  Later other call-preserved
registers are used, which leads to pushing all the registers (%rbx,
%rbp, and %r12-r15) on the stack.

000000000000af0 <page_pool_alloc_pages>:
{
 af0:   e8 00 00 00 00          callq  af5 <page_pool_alloc_pages+0x5>
                        af1: R_X86_64_PLT32     __fentry__-0x4
 af5:   41 57                   push   %r15
 af7:   41 56                   push   %r14
 af9:   41 89 f6                mov    %esi,%r14d
 afc:   41 55                   push   %r13
 afe:   41 54                   push   %r12
 b00:   55                      push   %rbp
 b01:   48 89 fd                mov    %rdi,%rbp
 b04:   53                      push   %rbx
 b05:   48 83 ec 08             sub    $0x8,%rsp
 b09:   65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # b10 <page_pool_alloc_pages+
0x20>
                        b0c: R_X86_64_PC32      __preempt_count-0x4
        if (likely(in_serving_softirq())) {
 b10:   f6 c4 01                test   $0x1,%ah
 b13:   74 60                   je     b75 <page_pool_alloc_pages+0x85>
                if (likely(pool->alloc.count)) {
 b15:   8b 87 c0 00 00 00       mov    0xc0(%rdi),%eax
 b1b:   85 c0                   test   %eax,%eax
 b1d:   0f 84 94 01 00 00       je     cb7 <page_pool_alloc_pages+0x1c7>
                        first_page = pool->alloc.cache[--pool->alloc.count];
 b23:   83 e8 01                sub    $0x1,%eax
 b26:   89 87 c0 00 00 00       mov    %eax,0xc0(%rdi)
 b2c:   4c 8b bc c7 c8 00 00    mov    0xc8(%rdi,%rax,8),%r15
 b33:   00 
        if (page)
 b34:   4d 85 ff                test   %r15,%r15
 b37:   74 23                   je     b5c <page_pool_alloc_pages+0x6c>
}
 b39:   48 83 c4 08             add    $0x8,%rsp
 b3d:   4c 89 f8                mov    %r15,%rax
 b40:   5b                      pop    %rbx
 b41:   5d                      pop    %rbp
 b42:   41 5c                   pop    %r12
 b44:   41 5d                   pop    %r13
 b46:   41 5e                   pop    %r14
 b48:   41 5f                   pop    %r15
 b4a:   c3                      retq   
 [...]


> @@ -117,17 +120,42 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
>  	if (__ptr_ring_empty(r))
>  		return NULL;
>  
> -	/* Slow-path: Get page from locked ring queue,
> -	 * refill alloc array if requested.
> +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
>  	 */
> +	curr_nid = numa_mem_id();
> +
> +	/* Slower-path: Get pages from locked ring queue */
>  	spin_lock(&r->consumer_lock);
> -	page = __ptr_ring_consume(r);
> -	if (refill)
> -		pool->alloc.count = __ptr_ring_consume_batched(r,
> -							pool->alloc.cache,
> -							PP_ALLOC_CACHE_REFILL);
> +	first_page = __ptr_ring_consume(r);
> +
> +	/* Fallback to page-allocator if NUMA node doesn't match */
> +	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> +		__page_pool_return_page(pool, first_page);
> +		first_page = NULL;
> +	}
> +
> +	if (unlikely(!refill))
> +		goto out;
> +
> +	/* Refill alloc array, but only if NUMA node match */
> +	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> +		page = __ptr_ring_consume(r);
> +		if (unlikely(!page))
> +			break;
> +
> +		if (likely(page_to_nid(page) == curr_nid)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +		} else {
> +			/* Release page to page-allocator, assume
> +			 * refcnt == 1 invariant of cached pages
> +			 */
> +			__page_pool_return_page(pool, page);
> +		}
> +	}
> +out:
>  	spin_unlock(&r->consumer_lock);
> -	return page;
> +	return first_page;
>  }

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer