linux-kernel - Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27475b57-eda1-4d67-93f2-5ca443632f6b@huawei.com>
Date: Thu, 21 Nov 2024 16:03:22 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>, <davem@...emloft.net>,
	<kuba@...nel.org>, <pabeni@...hat.com>
CC: <liuyonglong@...wei.com>, <fanghaiqing@...wei.com>,
	<zhangkun09@...wei.com>, Robin Murphy <robin.murphy@....com>, Alexander Duyck
	<alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>, Ilias Apalodimas
	<ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>, Simon
 Horman <horms@...nel.org>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
 already unbound

On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:

...

>>         /* No more consumers should exist, but producers could still
>>        * be in-flight.
>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>   }
>>   EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>   +static void page_pool_inflight_unmap(struct page_pool *pool)
>> +{
>> +    unsigned int unmapped = 0;
>> +    struct zone *zone;
>> +    int inflight;
>> +
>> +    if (!pool->dma_map || pool->mp_priv)
>> +        return;
>> +
>> +    get_online_mems();
>> +    spin_lock_bh(&pool->destroy_lock);
>> +
>> +    inflight = page_pool_inflight(pool, false);
>> +    for_each_populated_zone(zone) {
>> +        unsigned long end_pfn = zone_end_pfn(zone);
>> +        unsigned long pfn;
>> +
>> +        for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>> +            struct page *page = pfn_to_online_page(pfn);
>> +
>> +            if (!page || !page_count(page) ||
>> +                (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>> +                page->pp != pool)
>> +                continue;
>> +
>> +            dma_unmap_page_attrs(pool->p.dev,
>> +                         page_pool_get_dma_addr(page),
>> +                         PAGE_SIZE << pool->p.order,
>> +                         pool->p.dma_dir,
>> +                         DMA_ATTR_SKIP_CPU_SYNC |
>> +                         DMA_ATTR_WEAK_ORDERING);
>> +            page_pool_set_dma_addr(page, 0);
>> +
> 
> I feel this belongs in a helper function call.
> 
> Previously we had a function called: page_pool_release_page().

Previously I was going to reuse __page_pool_release_page_dma() for the above,
but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
checking for each page as page_pool_inflight_unmap() already check that before
calling __page_pool_release_page_dma().

> 
> This was used to convert the page into a normal page again, releasing
> page_pool dependencies.  It was used when packet transitioned into the
> netstack, but it was removed when we decided it was safe to let netstack
> return pp pages.
> 
> Removed in commits:
>  - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>  - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
> 
> 
>> +            unmapped++;
>> +
>> +            /* Skip scanning all pages when debug is disabled */
>> +            if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>> +                inflight == unmapped)
>> +                goto out;
>> +        }
>> +    }
>> +
>> +out:
>> +    WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>> +          "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>> +          pool->user.id, unmapped, inflight);
>> +
>> +    pool->dma_map = false;
>> +    spin_unlock_bh(&pool->destroy_lock);
>> +    put_online_mems();
>> +}
>> +
>>   void page_pool_destroy(struct page_pool *pool)
>>   {
>>       if (!pool)
>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>        */
>>       synchronize_rcu();
>>   +    page_pool_inflight_unmap(pool);
>> +
> 
> Reaching here means we have detected in-flight packets/pages.
> 
> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
> DMA unmap them. Then below we wait for these in-flight pages again.
> Why don't we just "release" (page_pool_release_page) those in-flight
> pages from belonging to the page_pool, when we found them during scanning?
> 
> If doing so, we can hopefully remove the periodic checking code below.

I thought about that too, but it means more complicated work than just
calling the page_pool_release_page() as page->pp_ref_count need to be
converted into page->_refcount for the above to work, it seems hard to
do that with least performance degradation as the racing against
page_pool_put_page() being called concurrently.

> 
>>       page_pool_detached(pool);
>>       pool->defer_start = jiffies;
>>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>       /* Flush pool alloc cache, as refill will check NUMA node */
>>       while (pool->alloc.count) {
>>           netmem = pool->alloc.cache[--pool->alloc.count];
>> -        page_pool_return_page(pool, netmem);
>> +        __page_pool_return_page(pool, netmem);
>>       }
>>   }
>>   EXPORT_SYMBOL(page_pool_update_nid);
> 
> Thanks for continuing to work on this :-)

I am not sure how scalable the scanning is going to be if the memory size became
bigger, which is one of the reason I was posting it as RFC for this version.

For some quick searching here, it seems there might be server with max ram capacity
of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
called from the softirq context, which might mean there might be spinning of 12 secs
in the softirq context.

And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
is released, which might means page_pool_get_dma_addr() need to be checked to decide
if the mapping is already done or not for each page.

Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
for those large memory systems.

https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH

> 
> --Jesper