linux-kernel - Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izM+sK=48gfa3gRNffu=T6t6-2vaS60QvH79zFA3gSDv9g@mail.gmail.com>
Date: Tue, 26 Nov 2024 13:51:11 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, davem@...emloft.net, kuba@...nel.org, 
	pabeni@...hat.com, liuyonglong@...wei.com, fanghaiqing@...wei.com, 
	zhangkun09@...wei.com, Robin Murphy <robin.murphy@....com>, 
	Alexander Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>, 
	Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
 already unbound

On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >
> >>       page_pool_detached(pool);
> >>       pool->defer_start = jiffies;
> >>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> >> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>       /* Flush pool alloc cache, as refill will check NUMA node */
> >>       while (pool->alloc.count) {
> >>           netmem = pool->alloc.cache[--pool->alloc.count];
> >> -        page_pool_return_page(pool, netmem);
> >> +        __page_pool_return_page(pool, netmem);
> >>       }
> >>   }
> >>   EXPORT_SYMBOL(page_pool_update_nid);
> >
> > Thanks for continuing to work on this :-)
>
> I am not sure how scalable the scanning is going to be if the memory size became
> bigger, which is one of the reason I was posting it as RFC for this version.
>
> For some quick searching here, it seems there might be server with max ram capacity
> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> called from the softirq context, which might mean there might be spinning of 12 secs
> in the softirq context.
>
> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> if the mapping is already done or not for each page.
>
> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> for those large memory systems.
>
> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
>

FWIW I'm also concerned about the looping of all memory on the system.
In addition to the performance, I think (but not sure), that
CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
disappear concurrently. Even if not, the CPU cost of this may be
significant. I'm imagining the possibility of having many page_pools
allocated on the system for many hardware queues, (and maybe multiple
pp's per queue for applications like devmem TCP), and each pp looping
over the entire xTB memory on page_pool_destroy()...

My 2 cents here is that a more reasonable approach is to have the pp
track all pages it has dma-mapped, without the problems in the
previous iterations of this patch:

1. When we dma-map a page, we add it to some pp->dma_mapped data
structure (maybe xarray or rculist).
2. When we dma-unmap a page, we remove it from pp->dma_mapped.
3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
pages there.

I haven't looked deeply, but with the right data structure we may be
able to synchronize 1, 2, and 3 without any additional locks. From a
quick skim it seems maybe rculist and xarray can do this without
additional locks, maybe.

Like stated in the previous iterations of this approach, we should not
be putting any hard limit on the amount of memory the pp can allocate,
and we should not have to mess with the page->pp entry in struct page.

-- 
Thanks,
Mina