linux-kernel - Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac728cc1-2ccb-4207-ae11-527a3ed8fbb6@kernel.org>
Date: Mon, 25 Nov 2024 16:25:00 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
 kuba@...nel.org, pabeni@...hat.com
Cc: liuyonglong@...wei.com, fanghaiqing@...wei.com, zhangkun09@...wei.com,
 Robin Murphy <robin.murphy@....com>,
 Alexander Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>,
 Eric Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
 already unbound



On 21/11/2024 09.03, Yunsheng Lin wrote:
> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>>          /* No more consumers should exist, but producers could still
>>>         * be in-flight.
>>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>>    }
>>>    EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>>    +static void page_pool_inflight_unmap(struct page_pool *pool)
>>> +{
>>> +    unsigned int unmapped = 0;
>>> +    struct zone *zone;
>>> +    int inflight;
>>> +
>>> +    if (!pool->dma_map || pool->mp_priv)
>>> +        return;
>>> +
>>> +    get_online_mems();
>>> +    spin_lock_bh(&pool->destroy_lock);
>>> +
>>> +    inflight = page_pool_inflight(pool, false);
>>> +    for_each_populated_zone(zone) {
>>> +        unsigned long end_pfn = zone_end_pfn(zone);
>>> +        unsigned long pfn;
>>> +
>>> +        for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>>> +            struct page *page = pfn_to_online_page(pfn);
>>> +
>>> +            if (!page || !page_count(page) ||
>>> +                (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>>> +                page->pp != pool)
>>> +                continue;
>>> +
>>> +            dma_unmap_page_attrs(pool->p.dev,
>>> +                         page_pool_get_dma_addr(page),
>>> +                         PAGE_SIZE << pool->p.order,
>>> +                         pool->p.dma_dir,
>>> +                         DMA_ATTR_SKIP_CPU_SYNC |
>>> +                         DMA_ATTR_WEAK_ORDERING);
>>> +            page_pool_set_dma_addr(page, 0);
>>> +
>>
>> I feel this belongs in a helper function call.
>>
>> Previously we had a function called: page_pool_release_page().
> 
> Previously I was going to reuse __page_pool_release_page_dma() for the above,
> but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
> checking for each page as page_pool_inflight_unmap() already check that before
> calling __page_pool_release_page_dma().
> 
>>
>> This was used to convert the page into a normal page again, releasing
>> page_pool dependencies.  It was used when packet transitioned into the
>> netstack, but it was removed when we decided it was safe to let netstack
>> return pp pages.
>>
>> Removed in commits:
>>   - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>>   - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
>>
>>
>>> +            unmapped++;
>>> +
>>> +            /* Skip scanning all pages when debug is disabled */
>>> +            if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>>> +                inflight == unmapped)
>>> +                goto out;
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>>> +          "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>>> +          pool->user.id, unmapped, inflight);
>>> +
>>> +    pool->dma_map = false;
>>> +    spin_unlock_bh(&pool->destroy_lock);
>>> +    put_online_mems();
>>> +}
>>> +
>>>    void page_pool_destroy(struct page_pool *pool)
>>>    {
>>>        if (!pool)
>>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>>         */
>>>        synchronize_rcu();
>>>    +    page_pool_inflight_unmap(pool);
>>> +
>>
>> Reaching here means we have detected in-flight packets/pages.
>>
>> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
>> DMA unmap them. Then below we wait for these in-flight pages again.
>> Why don't we just "release" (page_pool_release_page) those in-flight
>> pages from belonging to the page_pool, when we found them during scanning?
>>
>> If doing so, we can hopefully remove the periodic checking code below.
> 
> I thought about that too, but it means more complicated work than just
> calling the page_pool_release_page() as page->pp_ref_count need to be
> converted into page->_refcount for the above to work, it seems hard to
> do that with least performance degradation as the racing against
> page_pool_put_page() being called concurrently.
> 

Maybe we can have a design that avoid/reduce concurrency.  Can we
convert the suggested pool->destroy_lock into an atomic?
(Doing an *atomic* READ in page_pool_return_page, should be fast if we
keep this cache in in (cache coherence) Shared state).

In your new/proposed page_pool_return_page() when we see the
"destroy_cnt" (now atomic READ) bigger than zero, then we can do nothing
(or maybe we need decrement page-refcnt?), as we know the destroy code
will be taking care of "releasing" the pages from the page pool.

Once the a page is release from a page pool it becomes a normal page,
that adhere to normal page refcnt'ing. That is how it worked before with
page_pool_release_page().
The later extensions with page fragment support and devmem might have
complicated this code path.


>>
>>>        page_pool_detached(pool);
>>>        pool->defer_start = jiffies;
>>>        pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>>        /* Flush pool alloc cache, as refill will check NUMA node */
>>>        while (pool->alloc.count) {
>>>            netmem = pool->alloc.cache[--pool->alloc.count];
>>> -        page_pool_return_page(pool, netmem);
>>> +        __page_pool_return_page(pool, netmem);
>>>        }
>>>    }
>>>    EXPORT_SYMBOL(page_pool_update_nid);
>>
>> Thanks for continuing to work on this :-)
> 
> I am not sure how scalable the scanning is going to be if the memory size became
> bigger, which is one of the reason I was posting it as RFC for this version.
> 
> For some quick searching here, it seems there might be server with max ram capacity
> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> called from the softirq context, which might mean there might be spinning of 12 secs
> in the softirq context.
> 
> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> if the mapping is already done or not for each page.
> 

As explained above, once the "destroy" phase have started, I hope we can
avoid concurrently returned pages to wait on a lock.   As IMHO it will
be problematic to stall the page_pool_return_page() call for this long.
The take down "destroy" of a page pool is always initiated from
userspace, so is possible to call cond_resched() from this context.

--Jesper