netdev - Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3785321f-b2f8-d753-7efc-78ee40e6d0b6@redhat.com>
Date: Thu, 4 May 2023 15:48:10 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Yunsheng Lin <linyunsheng@...wei.com>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>, netdev@...r.kernel.org,
 Eric Dumazet <eric.dumazet@...il.com>, linux-mm@...ck.org,
 Mel Gorman <mgorman@...hsingularity.net>
Cc: brouer@...hat.com, lorenzo@...nel.org,
 Toke Høiland-Jørgensen <toke@...hat.com>,
 bpf@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>, willy@...radead.org
Subject: Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new
 shutdown scheme


On 04/05/2023 04.42, Yunsheng Lin wrote:
> On 2023/4/29 0:16, Jesper Dangaard Brouer wrote:
>>   void page_pool_release_page(struct page_pool *pool, struct page *page)
>>   {
>> +	unsigned int flags = READ_ONCE(pool->p.flags);
>>   	dma_addr_t dma;
>> -	int count;
>> +	u32 release_cnt;
>> +	u32 hold_cnt;
>>   
>>   	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
>>   		/* Always account for inflight pages, even if we didn't
>> @@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>>   skip_dma_unmap:
>>   	page_pool_clear_pp_info(page);
>>   
>> -	/* This may be the last page returned, releasing the pool, so
>> -	 * it is not safe to reference pool afterwards.
>> -	 */
>> -	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>> -	trace_page_pool_state_release(pool, page, count);
> 
> There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)"
> and flags checking, if page_pool_destroy() is called concurrently during that
> time window, it seems we will have a pp instance leaking problem here?
> 

Nope, that is resolved by the code changes in page_pool_destroy(), see 
below.

> It seems it is very hard to aovid this kind of corner case when using both
> flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp
> instance can be freed.
> Can we use something like biased reference counting, which used by frag support
> in page pool? So that we only need to check only one variable and avoid cache
> bouncing as much as possible.
> 

See below, I believe we are doing an equivalent refcnt bias trick, that
solves these corner cases in page_pool_destroy().
In short: hold_cnt is increased, prior to setting PP_FLAG_SHUTDOWN.
Thus, if this code READ_ONCE flags without PP_FLAG_SHUTDOWN, we know it
will not be the last to release pool->pages_state_release_cnt.
Below: Perhaps, we should add a RCU grace period to make absolutely
sure, that this code completes before page_pool_destroy() call completes.


>> +	if (flags & PP_FLAG_SHUTDOWN)
>> +		hold_cnt = pp_read_hold_cnt(pool);
>> +

I would like to avoid above code, and I'm considering using call_rcu(),
which I think will resolve the race[0] this code deals with.
As I explained here[0], this code deals with another kind of race.

  [0] 
https://lore.kernel.org/all/f671f5da-d9bc-a559-2120-10c3491e6f6d@redhat.com/

>> +	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
>> +	trace_page_pool_state_release(pool, page, release_cnt);
>> +
>> +	/* In shutdown phase, last page will free pool instance */
>> +	if (flags & PP_FLAG_SHUTDOWN)
>> +		page_pool_free_attempt(pool, hold_cnt, release_cnt);
>>   }
>>   EXPORT_SYMBOL(page_pool_release_page);
>>
> 
> ...
> 
>>   
>>   void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
>> @@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
>>   
>>   void page_pool_destroy(struct page_pool *pool)
>>   {
>> +	unsigned int flags;
>> +	u32 release_cnt;
>> +	u32 hold_cnt;
>> +
>>   	if (!pool)
>>   		return;
>>   
>> @@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool)
>>   	if (!page_pool_release(pool))
>>   		return;
>>   
>> -	pool->defer_start = jiffies;
>> -	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> +	/* PP have pages inflight, thus cannot immediately release memory.
>> +	 * Enter into shutdown phase, depending on remaining in-flight PP
>> +	 * pages to trigger shutdown process (on concurrent CPUs) and last
>> +	 * page will free pool instance.
>> +	 *
>> +	 * There exist two race conditions here, we need to take into
>> +	 * account in the following code.
>> +	 *
>> +	 * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>> +	 *    process, which can then be stalled forever.
>> +	 *
>> +	 * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>> +	 *    page, which triggered shutdown process and freed pool
>> +	 *    instance. Thus, its not safe to dereference *pool afterwards.
>> +	 *
>> +	 * Handling races by holding a fake in-flight count, via
>> +	 * artificially bumping pages_state_hold_cnt, which assures pool
>> +	 * isn't freed under us.  For race(1) its safe to recheck ptr_ring
>> +	 * (it will not free pool). Race(2) cannot happen, and we can
>> +	 * release fake in-flight count as last step.
>> +	 */
>> +	hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>> +	smp_store_release(&pool->pages_state_hold_cnt, hold_cnt);
> 
> I assume the smp_store_release() is used to ensure the correct order
> between the above store operations?
> There is data dependency between those two store operations, do we
> really need the smp_store_release() here?
> 
>> +	barrier();
>> +	flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
> 
> Do we need a stronger barrier like smp_rmb() to prevent cpu from
> executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN"
> before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)"
> even if there is a smp_store_release() barrier here?
> 
I do see you point and how it is related to your above comment for
page_pool_release_page().

I think we need to replace barrier() with synchronize_rcu().
Meaning we add a RCU grace period to "wait" for above code (in
page_pool_release_page) that read the old flags value to complete.


>> +	smp_store_release(&pool->p.flags, flags);

When doing a synchronize_rcu(), I assume this smp_store_release() is
overkill, right?
Will a WRITE_ONCE() be sufficient?

Hmm, the synchronize_rcu(), shouldn't that be *after* storing the flags?

>> +
>> +	/* Concurrent CPUs could have returned last pages into ptr_ring */
>> +	page_pool_empty_ring(pool);
>>   
>> -	INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
>> -	schedule_delayed_work(&pool->release_dw, DEFER_TIME);
>> +	release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
>> +	page_pool_free_attempt(pool, hold_cnt, release_cnt);
>>   }
>>   EXPORT_SYMBOL(page_pool_destroy);