linux-kernel - Re: [PATCH net v2 1/2] page_pool: fix timing for checking and disabling napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZvW-uEXITmZtncub@LQ3V64L9R2>
Date: Thu, 26 Sep 2024 13:06:16 -0700
From: Joe Damato <jdamato@...tly.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
	liuyonglong@...wei.com, fanghaiqing@...wei.com,
	zhangkun09@...wei.com,
	Alexander Lobakin <aleksander.lobakin@...el.com>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	Ilias Apalodimas <ilias.apalodimas@...aro.org>,
	Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, mkarsten@...terloo.ca
Subject: Re: [PATCH net v2 1/2] page_pool: fix timing for checking and
 disabling napi_local

On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() to
> softirq context, it may cause concurrent access problem for
> pool->alloc cache due to the below time window, as below,
> both CPU0 and CPU1 may access the pool->alloc cache
> concurrently in page_pool_empty_alloc_cache_once() and
> page_pool_recycle_in_cache():
> 
>           CPU 0                           CPU1
>     page_pool_destroy()          skb_defer_free_flush()
>            .                               .
>            .                   page_pool_put_unrefed_page()
>            .                               .
>            .               allow_direct = page_pool_napi_local()
>            .                               .
> page_pool_disable_direct_recycling()       .
>            .                               .
> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
> 
> Use rcu mechanism to avoid the above concurrent access problem.
> 
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
> CC: Alexander Lobakin <aleksander.lobakin@...el.com>

Sorry for the noise, but I hit an assert in page_pool_unref_netmem
and I am trying to figure out if it is related to what you all are
debugging? I thought it might be, but if not, my apologies.

Just in case it is, I've put the backtrace on github [1]. I
triggered this while testing an RFC [2] I've been working on. Please
note, the RFC posted publicly does not currently apply cleanly to
net-next and has some bugs I've fixed in my v4. I had planned to
send the v4 early next week and mention the page pool issue I am
hitting.

After triggering the assert in [1], I tried applying the patches of
this series and retesting the RFC v4 I have queued locally. When I
did that, I hit a new assertion page_pool_destroy [3].

There are a few possibilities:
   1. I am hitting the same issue you are hitting
   2. I am hitting a different issue caused by a bug I introduced
   3. I am hitting a different page pool issue entirely

In case of 2 and 3, my apologies for the noise.

In case of 1: If you think I am hitting the same issue as you are
trying to solve, I can reliably reproduce this with my RFC v4 and
would be happy to test any patches meant to fix the issue.

[1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt
[2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r
[3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt