netdev - Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 12 Dec 2019 09:04:47 +0800
From:   Yunsheng Lin <linyunsheng@...wei.com>
To:     Saeed Mahameed <saeedm@...lanox.com>,
        "brouer@...hat.com" <brouer@...hat.com>
CC:     "jonathan.lemon@...il.com" <jonathan.lemon@...il.com>,
        "ilias.apalodimas@...aro.org" <ilias.apalodimas@...aro.org>,
        Li Rongqing <lirongqing@...du.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE
 condition

On 2019/12/12 4:57, Saeed Mahameed wrote:
> On Wed, 2019-12-11 at 11:01 +0800, Yunsheng Lin wrote:
>> On 2019/12/11 3:45, Saeed Mahameed wrote:
>>>>> maybe assume that __page_pool_recycle_direct() is always called
>>>>> from
>>>>> the right node and change the current bogus check:
>>>>>
>>>>> from:
>>>>> page_to_nid(page) == pool->p.nid 
>>>>>
>>>>> to:
>>>>> page_to_nid(page) == numa_mem_id()
>>>>>
>>>>> This will allow recycling only if handling node is the same as
>>>>> where
>>>>> the page was allocated, regardless of pool->p.nid.
>>>>>
>>>>> so semantics are:
>>>>>
>>>>> 1) allocate from: pool->p.nid, as chosen by user.
>>>>> 2) recycle when: page_to_nid(page) == numa_mem_id().
>>>>> 3) pool user must guarantee that the handler will run on the
>>>>> right
>>>>> node. which should always be the case. otherwise recycling will
>>>>> be
>>>>> skipped (no cross numa recycling).
>>>>>
>>>>>
>>>>> a) if the pool migrates, we will stop recycling until the pool
>>>>> moves
>>>>> back to original node, or user calls pool_update_nid() as we do
>>>>> in
>>>>> mlx5.
>>>>> b) if pool is NUMA_NO_NODE, then allocation and handling will
>>>>> be
>>>>> done
>>>>> on numa_mem_id(), which means the above check will work.
>>>>
>>>> Only checking page_to_nid(page) == numa_mem_id() may not work for
>>>> the
>>>> below
>>>> case in mvneta.c:
>>>>
>>>> static int mvneta_create_page_pool(struct mvneta_port *pp,
>>>> 				   struct mvneta_rx_queue *rxq, int
>>>> size)
>>>> {
>>>> 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
>>>> 	struct page_pool_params pp_params = {
>>>> 		.order = 0,
>>>> 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> 		.pool_size = size,
>>>> 		.nid = cpu_to_node(0),
>>>> 		.dev = pp->dev->dev.parent,
>>>> 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
>>>> DMA_FROM_DEVICE,
>>>> 		.offset = pp->rx_offset_correction,
>>>> 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
>>>> 	};
>>>>
>>>> the pool->p.nid is not NUMA_NO_NODE, then the node of page
>>>> allocated
>>>> for rx
>>>> may not be numa_mem_id() when running in the NAPI polling,
>>>> because
>>>> pool->p.nid
>>>> is not the same as the node of cpu running in the NAPI polling.
>>>>
>>>> Does the page pool support recycling for above case?
>>>>
>>>
>>> I don't think you want to allow cross numa recycling.
>>
>> Cross numa recycling is not what I want.
>>
>>>> Or we "fix' the above case by setting pool->p.nid to
>>>> NUMA_NO_NODE/dev_to_node(),
>>>> or by calling pool_update_nid() in NAPI polling as mlx5 does?
>>>>
>>>
>>> Yes just update_nid when needed, and make sure the NAPI polling
>>> runs on
>>> a consistent core and eventually alloc/recycling will happen on the
>>> same core.
>>
>> To me, passing NUMA_NO_NODE/dev_to_node() seems to always work.
>> Calling pool_update_nid() in NAPI polling is another way of passing
>> NUMA_NO_NODE to page_pool_init().
>>
>> And it seems it is a copy & paste problem for mvneta and netsec
>> driver that uses cpu_to_node(0) as pool->p.nid but does not call
>> page_pool_nid_changed() in the NAPI polling as mlx5 does.
>>
>> So I suggest to remove page_pool_nid_changed() and always use
>> NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear (
>> by comment or warning?)that page_pool_nid_changed() should be
>> called when pool->p.nid is NUMA_NO_NODE/dev_to_node().
>>
> 
> not an option.
> 
> rx rings should always allocate data buffers according to their cpu
> affinity and not dev_node or default to NUMA_NO_NODE.

Does "their cpu affinity" mean numa_mem_id() when runnig in the
NAPI polling context?

If yes, the node of data buffers allocated for rx will be default to
numa_mem_id() when the pool->p.nid is NUMA_NO_NODE.

See:

/*
 * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
 * prefer the current CPU's closest node. Otherwise node must be valid and
 * online.
 */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
						unsigned int order)
{
	if (nid == NUMA_NO_NODE)
		nid = numa_mem_id();

	return __alloc_pages_node(nid, gfp_mask, order);
}

If the above is true, NUMA_NO_NODE is default to numa_mem_id().

> 
>> I prefer to remove page_pool_nid_changed() if we do not allow
>> cross numa recycling.
>>
> 
> This is not for cross numa recycling. 
> Since rx rings can migragte between cores, (msix affinity/IRQ balancer)
> we need page_pool_nid_changed() for seamless migration and for
> recycling and allocation to migrate with the ring.


If the allocation and recycling for rx data buffer is guaranteed to be in
NAPI polling context, when rx rings migragtes between nodes, it will stop
recycling the old pages allocated from old node, and start allocating new
page from new node and start recycling those new pages, because numa_mem_id()
will return different node id based on node the current cpu is running on.

Or allocation and recycling for rx data buffer will not be guaranteed to be
in the same NAPI polling context for a specific ring? Is there a usecase for
this?


> 
>>
>>>>> Thanks,
>>>>> Saeed.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>