linux-kernel - Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2aa84c61-6531-4f17-89e5-101f46ef00d0@huawei.com>
Date: Wed, 22 Jan 2025 19:02:29 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>, Yunsheng
 Lin <yunshenglin0825@...il.com>, <davem@...emloft.net>, <kuba@...nel.org>,
	<pabeni@...hat.com>
CC: <zhangkun09@...wei.com>, <liuyonglong@...wei.com>,
	<fanghaiqing@...wei.com>, Alexander Lobakin <aleksander.lobakin@...el.com>,
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Jesper Dangaard Brouer
	<hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Eric
 Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and
 disabling napi_local

On 2025/1/20 19:24, Toke Høiland-Jørgensen wrote:

...

>>>>   
>>>>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>>   	if (!page_pool_release(pool))
>>>>   		return;
>>>>   
>>>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>>> +	 * before returning to driver to free the napi instance.
>>>> +	 */
>>>> +	synchronize_rcu();
>>>
>>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>>> now you're introducing a full synchronize_rcu() wait for each queue.
>>> That can delay tearing down the device significantly, so I don't think
>>> this is a good idea.
>>
>> synchronize_rcu() is called after page_pool_release(pool), which means
>> it is only called when there are some inflight pages, so there is not
>> necessarily a full synchronize_rcu() wait for each queue.
>>
>> Anyway, it seems that there are some cases that need explicit
>> synchronize_rcu() and some cases depending on the other API providing
>> synchronize_rcu() semantics, maybe we provide two diffferent API for
>> both cases like the netif_napi_del()/__netif_napi_del() APIs do?
> 
> I don't think so. This race can only be triggered if:
> 
> - An skb is allocated from a page_pool with a napi instance attached
> 
> - That skb is freed *in softirq context* while the memory backing the
>   NAPI instance is being freed.
> 
> It's only valid to free a napi instance after calling netif_napi_del(),
> which does a full synchronise_rcu(). This means that any running
> softirqs will have exited at this point, and all packets will have been
> flushed from the deferred freeing queues. And since the NAPI has been
> stopped at this point, no new packets can enter the deferred freeing
> queue from that NAPI instance.

Note that the skb_defer_free_flush() can be called without bounding to
any NAPI instance, see the skb_defer_free_flush() called by net_rx_action(),
which means the packets from that NAPI instance can still be called in
the softirq context even when the NAPI has been stopped.

> 
> So I really don't see a way for this race to happen with correct usage
> of the page_pool and NAPI APIs, which means there's no reason to make
> the change you are proposing here.

I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
seems to call page_pool_disable_direct_recycling() when unloading, see
bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
only called for the new queue_mgmt API:

/* rtnl_lock held, this call can only be made after a previous successful
 * call to bnxt_half_open_nic().
 */
void bnxt_half_close_nic(struct bnxt *bp)
{
	bnxt_hwrm_resource_free(bp, false, true);
	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
	bnxt_free_skbs(bp);
	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
}

Even if there is a page_pool_disable_direct_recycling() called between
bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
rcu sync need to be called after page_pool_disable_direct_recycling(),
it seems some refactor is needed for bnxt driver to reuse the rcu sync
from the NAPI API, in order to avoid calling the rcu sync for
page_pool_destroy().


> 
> -Toke
> 
>