lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87plkhn2x7.fsf@toke.dk>
Date: Mon, 20 Jan 2025 12:24:36 +0100
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Yunsheng Lin <yunshenglin0825@...il.com>, Yunsheng Lin
 <linyunsheng@...wei.com>, davem@...emloft.net, kuba@...nel.org,
 pabeni@...hat.com
Cc: zhangkun09@...wei.com, liuyonglong@...wei.com, fanghaiqing@...wei.com,
 Alexander Lobakin <aleksander.lobakin@...el.com>, Xuan Zhuo
 <xuanzhuo@...ux.alibaba.com>, Jesper Dangaard Brouer <hawk@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>, Eric Dumazet
 <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and
 disabling napi_local

Yunsheng Lin <yunshenglin0825@...il.com> writes:

> On 1/10/2025 11:40 PM, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@...wei.com> writes:
>> 
>>> page_pool page may be freed from skb_defer_free_flush() in
>>> softirq context without binding to any specific napi, it
>>> may cause use-after-free problem due to the below time window,
>>> as below, CPU1 may still access napi->list_owner after CPU0
>>> free the napi memory:
>>>
>>>              CPU 0                           CPU1
>>>        page_pool_destroy()          skb_defer_free_flush()
>>>               .                               .
>>>               .                napi = READ_ONCE(pool->p.napi);
>>>               .                               .
>>> page_pool_disable_direct_recycling()         .
>>>     driver free napi memory                   .
>>>               .                               .
>>>               .       napi && READ_ONCE(napi->list_owner) == cpuid
>>>               .                               .
>> 
>> Have you actually observed this happen, or are you just speculating?
>
> I did not actually observe this happen, but I added some delaying and
> pr_err() debugging code in page_pool_napi_local()/page_pool_destroy(),
> and modified the test module for page_pool in [1] to trigger that it is
> indeed possible if the delay between reading napi and checking
> napi->list_owner is long enough.
>
> 1. 
> https://patchwork.kernel.org/project/netdevbpf/patch/20240909091913.987826-1-linyunsheng@huawei.com/

Right, I wasn't contesting whether it's possible to trigger this race by
calling those two functions directly in some fashion. I was asking
whether there are any drivers that use the API in a way that this race
can happen; because I would consider any such driver buggy, and we
should fix this rather than adding more cruft to the page_pool API. See
below.

>> Because I don't think it can; deleting a NAPI instance already requires
>> observing an RCU grace period, cf netdevice.h:
>> 
>> /**
>>   *  __netif_napi_del - remove a NAPI context
>>   *  @napi: NAPI context
>>   *
>>   * Warning: caller must observe RCU grace period before freeing memory
>>   * containing @napi. Drivers might want to call this helper to combine
>>   * all the needed RCU grace periods into a single one.
>>   */
>> void __netif_napi_del(struct napi_struct *napi);
>> 
>> /**
>>   *  netif_napi_del - remove a NAPI context
>>   *  @napi: NAPI context
>>   *
>>   *  netif_napi_del() removes a NAPI context from the network device NAPI list
>>   */
>> static inline void netif_napi_del(struct napi_struct *napi)
>> {
>> 	__netif_napi_del(napi);
>> 	synchronize_net();
>> }
>
> I am not sure we can reliably depend on the implicit synchronize_net()
> above if netif_napi_del() might not be called before page_pool_destroy()
> as there might not be netif_napi_del() before page_pool_destroy() for
> the case of changing rx_desc_num for a queue, which seems to be the case
> of hns3_set_ringparam() for hns3 driver.

The hns3 driver doesn't use pp->napi at all AFAICT, so that's hardly
relevant.

>> 
>> 
>>> Use rcu mechanism to avoid the above problem.
>>>
>>> Note, the above was found during code reviewing on how to fix
>>> the problem in [1].
>>>
>>> As the following IOMMU fix patch depends on synchronize_rcu()
>>> added in this patch and the time window is so small that it
>>> doesn't seem to be an urgent fix, so target the net-next as
>>> the IOMMU fix patch does.
>>>
>>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>>
>>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>>> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
>>> CC: Alexander Lobakin <aleksander.lobakin@...el.com>
>>> Reviewed-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
>>> ---
>>>   net/core/page_pool.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 9733206d6406..1aa7b93bdcc8 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>>   static bool page_pool_napi_local(const struct page_pool *pool)
>>>   {
>>>   	const struct napi_struct *napi;
>>> +	bool napi_local;
>>>   	u32 cpuid;
>>>   
>>>   	if (unlikely(!in_softirq()))
>>> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>>>   	if (READ_ONCE(pool->cpuid) == cpuid)
>>>   		return true;
>>>   
>>> +	/* Synchronizated with page_pool_destory() to avoid use-after-free
>>> +	 * for 'napi'.
>>> +	 */
>>> +	rcu_read_lock();
>>>   	napi = READ_ONCE(pool->p.napi);
>>> +	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
>>> +	rcu_read_unlock();
>> 
>> This rcu_read_lock/unlock() pair is redundant in the context you mention
>> above, since skb_defer_free_flush() is only ever called from softirq
>> context (within local_bh_disable()), which already function as an RCU
>> read lock.
>
> I thought about it, but I am not sure if we need a explicit rcu lock
> for different kernel PREEMPT and RCU config.
> Perhaps use rcu_read_lock_bh_held() to ensure that we are in the
> correct context?

page_pool_napi_local() returns immediately if in_softirq() returns
false. So the rcu_read_lock() is definitely not needed.

>> 
>>> -	return napi && READ_ONCE(napi->list_owner) == cpuid;
>>> +	return napi_local;
>>>   }
>>>   
>>>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>   	if (!page_pool_release(pool))
>>>   		return;
>>>   
>>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>> +	 * before returning to driver to free the napi instance.
>>> +	 */
>>> +	synchronize_rcu();
>> 
>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>> now you're introducing a full synchronize_rcu() wait for each queue.
>> That can delay tearing down the device significantly, so I don't think
>> this is a good idea.
>
> synchronize_rcu() is called after page_pool_release(pool), which means
> it is only called when there are some inflight pages, so there is not
> necessarily a full synchronize_rcu() wait for each queue.
>
> Anyway, it seems that there are some cases that need explicit
> synchronize_rcu() and some cases depending on the other API providing
> synchronize_rcu() semantics, maybe we provide two diffferent API for
> both cases like the netif_napi_del()/__netif_napi_del() APIs do?

I don't think so. This race can only be triggered if:

- An skb is allocated from a page_pool with a napi instance attached

- That skb is freed *in softirq context* while the memory backing the
  NAPI instance is being freed.

It's only valid to free a napi instance after calling netif_napi_del(),
which does a full synchronise_rcu(). This means that any running
softirqs will have exited at this point, and all packets will have been
flushed from the deferred freeing queues. And since the NAPI has been
stopped at this point, no new packets can enter the deferred freeing
queue from that NAPI instance.

So I really don't see a way for this race to happen with correct usage
of the page_pool and NAPI APIs, which means there's no reason to make
the change you are proposing here.

-Toke


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ