linux-kernel - Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c2b306af-4817-4169-814b-adbf25803919@huawei.com>
Date: Fri, 6 Dec 2024 20:29:40 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: <davem@...emloft.net>, <pabeni@...hat.com>, <liuyonglong@...wei.com>,
	<fanghaiqing@...wei.com>, <zhangkun09@...wei.com>, Alexander Lobakin
	<aleksander.lobakin@...el.com>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>, Simon
 Horman <horms@...nel.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and
 disabling napi_local

On 2024/12/6 8:42, Jakub Kicinski wrote:
> On Thu, 5 Dec 2024 19:43:25 +0800 Yunsheng Lin wrote:
>> It depends on what is the callers is trying to protect by calling
>> page_pool_disable_direct_recycling().
>>
>> It seems the use case for the only user of the API in bnxt driver
>> is about reuseing the same NAPI for different page_pool instances.
>>
>> According to the steps in netdev_rx_queue.c:
>> 1. allocate new queue memory & create page_pool
>> 2. stop old rx queue.
>> 3. start new rx queue with new page_pool
>> 4. free old queue memory + destroy page_pool.
>>
>> The page_pool_disable_direct_recycling() is called in step 2, I am
>> not sure how napi_enable() & napi_disable() are called in the above
>> flow, but it seems there is no use-after-free problem this patch is
>> trying to fix for the above flow.
>>
>> It doesn't seems to have any concurrent access problem if napi->list_owner
>> is set to -1 before napi_disable() returns and the napi_enable() for the
>> new queue is called after page_pool_disable_direct_recycling() is called
>> in step 2.
> 
> The fix is presupposing there is long delay between fetching of
> the NAPI pointer and its access. The concern is that NAPI gets
> restarted in step 3 after we already READ_ONCE()'ed the pointer,
> then we access it and judge it to be running on the same core.
> Then we put the page into the fast cache which will never get
> flushed.

It seems the napi_disable() is called before netdev_rx_queue_restart()
and napi_enable() and ____napi_schedule() are called after
netdev_rx_queue_restart() as there is no napi API called in the
implementation of 'netdev_queue_mgmt_ops' for bnxt driver?

If yes, napi->list_owner is set to -1 before step 1 and only set to
a valid cpu in step 6 as below:
1. napi_disable()
2. allocate new queue memory & create new page_pool.
3. stop old rx queue.
4. start new rx queue with new page_pool.
5. free old queue memory + destroy old page_pool.
6. napi_enable() & ____napi_schedule()

And there are at least three flows involved here:
flow 1: calling napi_complete_done() and set napi->list_owner to -1.
flow 2: calling netdev_rx_queue_restart().
flow 3: calling skb_defer_free_flush() with the page belonging to the old
       page_pool.

The only case of page_pool_napi_local() returning true in flow 3 I can
think of is that flow 1 and flow 3 might need to be called in the softirq
of the same CPU and flow 3 might need to be called before flow 1.

It seems impossible that page_pool_napi_local() will return true between
step 1 and step 6 as updated napi->list_owner is always seen by flow 3
when they are both called in the softirq context of the same CPU or
napi->list_owner != CPU that calling flow 3, which seems like an implicit
assumption for the case of napi scheduling between different cpus too.

And old page_pool is destroyed in step 5, I am not sure if it is necessary
to call page_pool_disable_direct_recycling() in step 3 if page_pool_destroy()
already have the synchronize_rcu() in step 5 before enabling napi.

If not, maybe I am missing something here. It would be good to be more specific
about the timing window that page_pool_napi_local() returning true for the old
page_pool.

>