lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64eeeee0-8ec8-458b-b01a-ac1dd6c96583@huawei.com>
Date: Fri, 27 Sep 2024 11:58:47 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Joe Damato <jdamato@...tly.com>, <davem@...emloft.net>, <kuba@...nel.org>,
	<pabeni@...hat.com>, <liuyonglong@...wei.com>, <fanghaiqing@...wei.com>,
	<zhangkun09@...wei.com>, Alexander Lobakin <aleksander.lobakin@...el.com>,
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<mkarsten@...terloo.ca>
Subject: Re: [PATCH net v2 1/2] page_pool: fix timing for checking and
 disabling napi_local

On 2024/9/27 4:06, Joe Damato wrote:
> On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
>> page_pool page may be freed from skb_defer_free_flush() to
>> softirq context, it may cause concurrent access problem for
>> pool->alloc cache due to the below time window, as below,
>> both CPU0 and CPU1 may access the pool->alloc cache
>> concurrently in page_pool_empty_alloc_cache_once() and
>> page_pool_recycle_in_cache():
>>
>>           CPU 0                           CPU1
>>     page_pool_destroy()          skb_defer_free_flush()
>>            .                               .
>>            .                   page_pool_put_unrefed_page()
>>            .                               .
>>            .               allow_direct = page_pool_napi_local()
>>            .                               .
>> page_pool_disable_direct_recycling()       .
>>            .                               .
>> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
>>
>> Use rcu mechanism to avoid the above concurrent access problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>
>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
>> CC: Alexander Lobakin <aleksander.lobakin@...el.com>
> 
> Sorry for the noise, but I hit an assert in page_pool_unref_netmem
> and I am trying to figure out if it is related to what you all are
> debugging? I thought it might be, but if not, my apologies.
> 
> Just in case it is, I've put the backtrace on github [1]. I
> triggered this while testing an RFC [2] I've been working on. Please
> note, the RFC posted publicly does not currently apply cleanly to
> net-next and has some bugs I've fixed in my v4. I had planned to
> send the v4 early next week and mention the page pool issue I am
> hitting.
> 
> After triggering the assert in [1], I tried applying the patches of
> this series and retesting the RFC v4 I have queued locally. When I
> did that, I hit a new assertion page_pool_destroy [3].
> 
> There are a few possibilities:
>    1. I am hitting the same issue you are hitting
>    2. I am hitting a different issue caused by a bug I introduced
>    3. I am hitting a different page pool issue entirely
> 
> In case of 2 and 3, my apologies for the noise.
> 
> In case of 1: If you think I am hitting the same issue as you are
> trying to solve, I can reliably reproduce this with my RFC v4 and
> would be happy to test any patches meant to fix the issue.
> 
> [1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt

It seems there may be some concurrent access of page->pp_ref_count/
page_pool->frag_users or imbalanceā€Œ incrementing and decrementing of
page->pp_ref_count.

For the concurrent access part, you may refer to the below patch
for debugging, but it seems mlx5 driver doesn't use frag API directly
as my last recall, so you can't use it directly.

https://lore.kernel.org/all/20240903122208.3379182-1-linyunsheng@huawei.com/

> [2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r
> [3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt

There warning is only added to see if there is any infight page that
need dma unmapping when page_pool_destroy() is called, you can ignore
that warning or remove that WARN_ONCE() in page_pool_item_uninit()
when testing.

> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ