[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac728cc1-2ccb-4207-ae11-527a3ed8fbb6@kernel.org>
Date: Mon, 25 Nov 2024 16:25:00 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
kuba@...nel.org, pabeni@...hat.com
Cc: liuyonglong@...wei.com, fanghaiqing@...wei.com, zhangkun09@...wei.com,
Robin Murphy <robin.murphy@....com>,
Alexander Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Eric Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
already unbound
On 21/11/2024 09.03, Yunsheng Lin wrote:
> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
>
> ...
>
>>> /* No more consumers should exist, but producers could still
>>> * be in-flight.
>>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>> }
>>> EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>> +static void page_pool_inflight_unmap(struct page_pool *pool)
>>> +{
>>> + unsigned int unmapped = 0;
>>> + struct zone *zone;
>>> + int inflight;
>>> +
>>> + if (!pool->dma_map || pool->mp_priv)
>>> + return;
>>> +
>>> + get_online_mems();
>>> + spin_lock_bh(&pool->destroy_lock);
>>> +
>>> + inflight = page_pool_inflight(pool, false);
>>> + for_each_populated_zone(zone) {
>>> + unsigned long end_pfn = zone_end_pfn(zone);
>>> + unsigned long pfn;
>>> +
>>> + for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>>> + struct page *page = pfn_to_online_page(pfn);
>>> +
>>> + if (!page || !page_count(page) ||
>>> + (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>>> + page->pp != pool)
>>> + continue;
>>> +
>>> + dma_unmap_page_attrs(pool->p.dev,
>>> + page_pool_get_dma_addr(page),
>>> + PAGE_SIZE << pool->p.order,
>>> + pool->p.dma_dir,
>>> + DMA_ATTR_SKIP_CPU_SYNC |
>>> + DMA_ATTR_WEAK_ORDERING);
>>> + page_pool_set_dma_addr(page, 0);
>>> +
>>
>> I feel this belongs in a helper function call.
>>
>> Previously we had a function called: page_pool_release_page().
>
> Previously I was going to reuse __page_pool_release_page_dma() for the above,
> but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
> checking for each page as page_pool_inflight_unmap() already check that before
> calling __page_pool_release_page_dma().
>
>>
>> This was used to convert the page into a normal page again, releasing
>> page_pool dependencies. It was used when packet transitioned into the
>> netstack, but it was removed when we decided it was safe to let netstack
>> return pp pages.
>>
>> Removed in commits:
>> - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>> - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
>>
>>
>>> + unmapped++;
>>> +
>>> + /* Skip scanning all pages when debug is disabled */
>>> + if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>>> + inflight == unmapped)
>>> + goto out;
>>> + }
>>> + }
>>> +
>>> +out:
>>> + WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>>> + "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>>> + pool->user.id, unmapped, inflight);
>>> +
>>> + pool->dma_map = false;
>>> + spin_unlock_bh(&pool->destroy_lock);
>>> + put_online_mems();
>>> +}
>>> +
>>> void page_pool_destroy(struct page_pool *pool)
>>> {
>>> if (!pool)
>>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>> */
>>> synchronize_rcu();
>>> + page_pool_inflight_unmap(pool);
>>> +
>>
>> Reaching here means we have detected in-flight packets/pages.
>>
>> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
>> DMA unmap them. Then below we wait for these in-flight pages again.
>> Why don't we just "release" (page_pool_release_page) those in-flight
>> pages from belonging to the page_pool, when we found them during scanning?
>>
>> If doing so, we can hopefully remove the periodic checking code below.
>
> I thought about that too, but it means more complicated work than just
> calling the page_pool_release_page() as page->pp_ref_count need to be
> converted into page->_refcount for the above to work, it seems hard to
> do that with least performance degradation as the racing against
> page_pool_put_page() being called concurrently.
>
Maybe we can have a design that avoid/reduce concurrency. Can we
convert the suggested pool->destroy_lock into an atomic?
(Doing an *atomic* READ in page_pool_return_page, should be fast if we
keep this cache in in (cache coherence) Shared state).
In your new/proposed page_pool_return_page() when we see the
"destroy_cnt" (now atomic READ) bigger than zero, then we can do nothing
(or maybe we need decrement page-refcnt?), as we know the destroy code
will be taking care of "releasing" the pages from the page pool.
Once the a page is release from a page pool it becomes a normal page,
that adhere to normal page refcnt'ing. That is how it worked before with
page_pool_release_page().
The later extensions with page fragment support and devmem might have
complicated this code path.
>>
>>> page_pool_detached(pool);
>>> pool->defer_start = jiffies;
>>> pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>> /* Flush pool alloc cache, as refill will check NUMA node */
>>> while (pool->alloc.count) {
>>> netmem = pool->alloc.cache[--pool->alloc.count];
>>> - page_pool_return_page(pool, netmem);
>>> + __page_pool_return_page(pool, netmem);
>>> }
>>> }
>>> EXPORT_SYMBOL(page_pool_update_nid);
>>
>> Thanks for continuing to work on this :-)
>
> I am not sure how scalable the scanning is going to be if the memory size became
> bigger, which is one of the reason I was posting it as RFC for this version.
>
> For some quick searching here, it seems there might be server with max ram capacity
> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> called from the softirq context, which might mean there might be spinning of 12 secs
> in the softirq context.
>
> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> if the mapping is already done or not for each page.
>
As explained above, once the "destroy" phase have started, I hope we can
avoid concurrently returned pages to wait on a lock. As IMHO it will
be problematic to stall the page_pool_return_page() call for this long.
The take down "destroy" of a page pool is always initiated from
userspace, so is possible to call cond_resched() from this context.
--Jesper
Powered by blists - more mailing lists