netdev - Re: [BUG] Possible unsafe page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <923d74d4-3d43-8cac-9732-c55103f6dafb@intel.com>
Date: Thu, 24 Aug 2023 17:26:21 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
CC: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	<netdev@...r.kernel.org>, Ratheesh Kannoth <rkannoth@...vell.com>, "David S.
 Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, "Geetha
 sowjanya" <gakula@...vell.com>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, Subbaraya Sundeep <sbhatta@...vell.com>, Sunil Goutham
	<sgoutham@...vell.com>, Thomas Gleixner <tglx@...utronix.de>, hariprasad
	<hkelam@...vell.com>, Qingfang DENG <qingfang.deng@...lower.com.cn>
Subject: Re: [BUG] Possible unsafe page_pool usage in octeontx2

From: Jesper Dangaard Brouer <hawk@...nel.org>
Date: Wed, 23 Aug 2023 21:45:04 +0200

> (Cc Olek as he have changes in this code path)

Thanks! I was reading the thread a bit on LKML, but being in the CC list
is more convenient :D

> 
> On 23/08/2023 11.47, Sebastian Andrzej Siewior wrote:
>> Hi,
>>
>> I've been looking at the page_pool locking.
>>
>> page_pool_alloc_frag() -> page_pool_alloc_pages() ->
>> __page_pool_get_cached():
>>
>> There core of the allocation is:
>> |         /* Caller MUST guarantee safe non-concurrent access, e.g.
>> softirq */
>> |         if (likely(pool->alloc.count)) {
>> |                 /* Fast-path */
>> |                 page = pool->alloc.cache[--pool->alloc.count];
>>
>> The access to the `cache' array and the `count' variable is not locked.
>> This is fine as long as there only one consumer per pool. In my
>> understanding the intention is to have one page_pool per NAPI callback
>> to ensure this.
>>
> 
> Yes, the intention is a single PP instance is "bound" to one RX-NAPI.

Isn't that also a misuse of page_pool->p.napi? I thought it can be set
only when page allocation and cache refill happen both inside the same
NAPI polling function. Otx2 uses workqueues to refill the queues,
meaning that consumer and producer can happen in different contexts or
even threads and it shouldn't set p.napi.

> 
> 
>> The pool can be filled in the same context (within allocation if the
>> pool is empty). There is also page_pool_recycle_in_cache() which fills
>> the pool from within skb free, for instance:
>>   napi_consume_skb() -> skb_release_all() -> skb_release_data() ->
>>   napi_frag_unref() -> page_pool_return_skb_page().
>>
>> The last one has the following check here:
>> |         napi = READ_ONCE(pp->p.napi);
>> |         allow_direct = napi_safe && napi &&
>> |                 READ_ONCE(napi->list_owner) == smp_processor_id();
>>
>> This eventually ends in page_pool_recycle_in_cache() where it adds the
>> page to the cache buffer if the check above is true (and BH is disabled).
>>
>> napi->list_owner is set once NAPI is scheduled until the poll callback
>> completed. It is safe to add items to list because only one of the two
>> can run on a single CPU and the completion of them ensured by having BH
>> disabled the whole time.
>>
>> This breaks in octeontx2 where a worker is used to fill the buffer:
>>    otx2_pool_refill_task() -> otx2_alloc_rbuf() -> __otx2_alloc_rbuf() ->
>>    otx2_alloc_pool_buf() -> page_pool_alloc_frag().
>>
> 
> This seems problematic! - this is NOT allowed.
> 
> But otx2_pool_refill_task() is a work-queue, and I though it runs in
> process-context.  This WQ process is not allowed to use the lockless PP
> cache.  This seems to be a bug!
> 
> The problematic part is otx2_alloc_rbuf() that disables BH:
> 
>  int otx2_alloc_rbuf(struct otx2_nic *pfvf, struct otx2_pool *pool,
>             dma_addr_t *dma)
>  {
>     int ret;
> 
>     local_bh_disable();
>     ret = __otx2_alloc_rbuf(pfvf, pool, dma);
>     local_bh_enable();
>     return ret;
>  }
> 
> The fix, can be to not do this local_bh_disable() in this driver?
> 
>> BH is disabled but the add of a page can still happen while NAPI
>> callback runs on a remote CPU and so corrupting the index/ array.
>>
>> API wise I would suggest to
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 7ff80b80a6f9f..b50e219470a36 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -612,7 +612,7 @@ __page_pool_put_page(struct page_pool *pool,
>> struct page *page,
>>               page_pool_dma_sync_for_device(pool, page,
>>                                 dma_sync_size);
>>   -        if (allow_direct && in_softirq() &&
>> +        if (allow_direct && in_serving_softirq() &&
> 
> This is the "return/free/put" code path, where we have "allow_direct" as
> a protection in the API.  API users are suppose to use
> page_pool_recycle_direct() to indicate this, but as some point we
> allowed APIs to expose 'allow_direct'.
> 
> The PP-alloc side is more fragile, and maybe the in_serving_softirq()
> belongs there.
> 
>>               page_pool_recycle_in_cache(page, pool))
>>               return NULL;
>>   because the intention (as I understand it) is to be invoked from within
>> the NAPI callback (while softirq is served) and not if BH is just
>> disabled due to a lock or so.
>>
> 
> True, and it used-to-be like this (in_serving_softirq), but as Ilias
> wrote it was changed recently.  This was to support threaded-NAPI (in
> 542bcea4be866b ("net: page_pool: use in_softirq() instead")), which
> I understood was one of your (Sebastian's) use-cases.
> 
> 
>> It would also make sense to a add WARN_ON_ONCE(!in_serving_softirq()) to
>> page_pool_alloc_pages() to spot usage outside of softirq. But this will
>> trigger in every driver since the same function is used in the open
>> callback to initially setup the HW.
>>
> 
> I'm very open to ideas of detecting this.  Since mentioned commit PP is
> open to these kind of miss-uses of the API.
> 
> One idea would be to leverage that NAPI napi->list_owner will have been
> set to something else than -1, when this is NAPI context.  Getting hold
> of napi object, could be done via pp->p.napi (but as Jakub wrote this is
> opt-in ATM).
> 
> --Jesper

Thanks,
Olek