[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvW-uEXITmZtncub@LQ3V64L9R2>
Date: Thu, 26 Sep 2024 13:06:16 -0700
From: Joe Damato <jdamato@...tly.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
liuyonglong@...wei.com, fanghaiqing@...wei.com,
zhangkun09@...wei.com,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, mkarsten@...terloo.ca
Subject: Re: [PATCH net v2 1/2] page_pool: fix timing for checking and
disabling napi_local
On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() to
> softirq context, it may cause concurrent access problem for
> pool->alloc cache due to the below time window, as below,
> both CPU0 and CPU1 may access the pool->alloc cache
> concurrently in page_pool_empty_alloc_cache_once() and
> page_pool_recycle_in_cache():
>
> CPU 0 CPU1
> page_pool_destroy() skb_defer_free_flush()
> . .
> . page_pool_put_unrefed_page()
> . .
> . allow_direct = page_pool_napi_local()
> . .
> page_pool_disable_direct_recycling() .
> . .
> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache()
>
> Use rcu mechanism to avoid the above concurrent access problem.
>
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
>
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
> CC: Alexander Lobakin <aleksander.lobakin@...el.com>
Sorry for the noise, but I hit an assert in page_pool_unref_netmem
and I am trying to figure out if it is related to what you all are
debugging? I thought it might be, but if not, my apologies.
Just in case it is, I've put the backtrace on github [1]. I
triggered this while testing an RFC [2] I've been working on. Please
note, the RFC posted publicly does not currently apply cleanly to
net-next and has some bugs I've fixed in my v4. I had planned to
send the v4 early next week and mention the page pool issue I am
hitting.
After triggering the assert in [1], I tried applying the patches of
this series and retesting the RFC v4 I have queued locally. When I
did that, I hit a new assertion page_pool_destroy [3].
There are a few possibilities:
1. I am hitting the same issue you are hitting
2. I am hitting a different issue caused by a bug I introduced
3. I am hitting a different page pool issue entirely
In case of 2 and 3, my apologies for the noise.
In case of 1: If you think I am hitting the same issue as you are
trying to solve, I can reliably reproduce this with my RFC v4 and
would be happy to test any patches meant to fix the issue.
[1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt
[2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r
[3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt
Powered by blists - more mailing lists