linux-kernel - Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <72f388e8-adef-4ecc-812d-a1d19c801df5@gmail.com>
Date: Tue, 4 Feb 2025 21:51:18 +0800
From: Yunsheng Lin <yunshenglin0825@...il.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>,
 Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net, kuba@...nel.org,
 pabeni@...hat.com
Cc: zhangkun09@...wei.com, liuyonglong@...wei.com, fanghaiqing@...wei.com,
 Alexander Lobakin <aleksander.lobakin@...el.com>,
 Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>,
 Eric Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and
 disabling napi_local

On 1/27/2025 9:47 PM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@...il.com> writes:
> 
>> On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
>>> Yunsheng Lin <linyunsheng@...wei.com> writes:
>>>
>>>>> So I really don't see a way for this race to happen with correct usage
>>>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>>>> the change you are proposing here.
>>>>
>>>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>>>> seems to call page_pool_disable_direct_recycling() when unloading, see
>>>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>>>> only called for the new queue_mgmt API:
>>>>
>>>> /* rtnl_lock held, this call can only be made after a previous successful
>>>>    * call to bnxt_half_open_nic().
>>>>    */
>>>> void bnxt_half_close_nic(struct bnxt *bp)
>>>> {
>>>> 	bnxt_hwrm_resource_free(bp, false, true);
>>>> 	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
>>>> 	bnxt_free_skbs(bp);
>>>> 	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>>>> 	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>>>> }
>>>>
>>>> Even if there is a page_pool_disable_direct_recycling() called between
>>>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>>>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>>>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>>>> from the NAPI API, in order to avoid calling the rcu sync for
>>>> page_pool_destroy().
>>>
>>> Well, I would consider that usage buggy. A page pool object is created
>>> with a reference to the napi struct; so the page pool should also be
>>> destroyed (clearing its reference) before the napi memory is freed. I
>>> guess this is not really documented anywhere, but it's pretty standard
>>> practice to free objects in the opposite order of their creation.
>>
>> I am not so familiar with rule about the creation API of NAPI, but the
>> implementation of bnxt driver can have reference of 'struct napi' before
>> calling netif_napi_add(), see below:
>>
>> static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool
>> link_re_init)
>> {
>> 	.......
>> 	rc = bnxt_alloc_mem(bp, irq_re_init);     *create page_pool*
>> 	if (rc) {
>> 		netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
>> 		goto open_err_free_mem;
>> 	}
>>
>> 	if (irq_re_init) {
>> 		bnxt_init_napi(bp);                *netif_napi_add*
>> 		rc = bnxt_request_irq(bp);
>> 		if (rc) {
>> 			netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
>> 			goto open_err_irq;
>> 		}
>> 	}
>>
>> 	.....
>> }
> 
> Regardless of the initialisation error, the fact that bnxt frees the
> NAPI memory before calling page_pool_destroy() is a driver bug. Mina has
> a suggestion for a warning to catch such bugs over in this thread:
> 
> https://lore.kernel.org/r/CAHS8izOv=tUiuzha6NFq1-ZurLGz9Jdi78jb3ey4ExVJirMprA@mail.gmail.com

Thanks for the reminder.
As the main problem is about adding a rcu sync between
page_pool_disable_direct_recycling() and page_pool_destroy(), I am
really doubtful that a warning can be added to catch such bugs if
page_pool_destroy() does not use an explicit rcu sync and rely on
the rcu sync from napi del API.

> 
>>> So no, I don't think this is something that should be fixed on the page
>>> pool side (and certainly not by adding another synchronize_rcu() call
>>> per queue!); rather, we should fix the drivers that get this wrong (and
>>> probably document the requirement a bit better).
>>
>> Even if timing problem of checking and disabling napi_local should not
>> be fixed on the page_pool side, do we have some common understanding
>> about fixing the DMA API misuse problem on the page_pool side?
>> If yes, do we have some common understanding about some mechanism
>> like synchronize_rcu() might be still needed on the page_pool side?
> 
> I have not reviewed the rest of your patch set, I only looked at this
> patch. I see you posted v8 without addressing Jesper's ask for a
> conceptual description of your design. I am not going to review a
> 600-something line patch series without such a description to go by, so
> please address that first.

I thought what Jesper'ask was mainly about why hijacking the page->pp
pointer.
I summarized the discussion in [1] as below, please let me know if that
addresses your concern too.

"By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to do dma
unmmaping when page_pool_destroy() is called if some pages are still
handled in networking stack, and networking stack is also able to find
the page_pool owning the page when returning pages back into page_pool.

struct page_pool_item {
	unsigned long state;
	
	union {
		netmem_ref pp_netmem;
		struct llist_node lentry;
	};
};

When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page and set
'state' accordingly in order to keep track of that page, refill from
pool->release_items when pool->hold_items is empty or use the item from
pool->slow_items when fast items run out.

When a page is released from the page_pool, it is able to tell which
page_pool this page belongs to by using the below functions:

static inline struct page_pool_item_block *
page_pool_item_to_block(struct page_pool_item *item)
{
	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
}

static inline struct page_pool *page_pool_get_pp(struct page *page)
{
	/* The size of item_block is always PAGE_SIZE, the address of item_block
	 * for a specific item can be calculated using 'item & PAGE_MASK', so
	 * that we can find the page_pool object it belongs to.
	 */
	return page_pool_item_to_block(page->pp_item)->pp;
  }

and after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for new
pages or just free it when it is from the pool->slow_items.

When page_pool_destroy() is called, pp_item->state is used to tell if a 
specific item is being used/dma mapped or not by scanning all the item 
blocks in pool->item_blocks, then pp_item->netmem can be used to do the
dma unmmaping if the corresponding inflight page is dma mapped."

1. 
https://lore.kernel.org/all/2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com/

> 
>> If yes, it may be better to focus on discussing how to avoid calling rcu
>> sync for each queue mentioned in [1].
> 
> Regardless of whether a synchronize_rcu() is needed in the final design
> (and again, note that I don't have an opinion on this before reviewing
> the whole series), this patch should be dropped from the series. The bug
> it is purporting to fix is a driver API misuse and should be fixed in
> the drivers, cf the above.

I am still a little doubltful it is a driver API misuse problem yet as
I am not true if page_pool_destroy() can depend on the rcu sync from
napi del API for all cases. Even if it is, this driver API misuse
problem seems to only exist after page_pool NAPI recycling feature/API
is added, which might mean some refactoring needed from the driver side
to support page_pool NAPI recycling.

Anyway, it seems to make sense to drop this patch from the series for
better forward progressing for the dma misuse problem as they are not
really related.

> 
> -Toke
>