[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKgT0Uc-SDHsGkgmLeAuo5GLE0H43i3h7mmzG88BQojfCoQGGA@mail.gmail.com>
Date: Tue, 26 Nov 2024 15:53:53 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Yunsheng Lin <linyunsheng@...wei.com>, Jesper Dangaard Brouer <hawk@...nel.org>, davem@...emloft.net,
kuba@...nel.org, pabeni@...hat.com, liuyonglong@...wei.com,
fanghaiqing@...wei.com, zhangkun09@...wei.com,
Robin Murphy <robin.murphy@....com>, IOMMU <iommu@...ts.linux.dev>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>,
Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
already unbound
On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@...gle.com> wrote:
>
> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >
> > On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> > >
> > >> page_pool_detached(pool);
> > >> pool->defer_start = jiffies;
> > >> pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
> > >> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> > >> /* Flush pool alloc cache, as refill will check NUMA node */
> > >> while (pool->alloc.count) {
> > >> netmem = pool->alloc.cache[--pool->alloc.count];
> > >> - page_pool_return_page(pool, netmem);
> > >> + __page_pool_return_page(pool, netmem);
> > >> }
> > >> }
> > >> EXPORT_SYMBOL(page_pool_update_nid);
> > >
> > > Thanks for continuing to work on this :-)
> >
> > I am not sure how scalable the scanning is going to be if the memory size became
> > bigger, which is one of the reason I was posting it as RFC for this version.
> >
> > For some quick searching here, it seems there might be server with max ram capacity
> > of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> > The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> > called from the softirq context, which might mean there might be spinning of 12 secs
> > in the softirq context.
> >
> > And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> > of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> > is released, which might means page_pool_get_dma_addr() need to be checked to decide
> > if the mapping is already done or not for each page.
> >
> > Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> > for those large memory systems.
> >
> > https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >
>
> FWIW I'm also concerned about the looping of all memory on the system.
> In addition to the performance, I think (but not sure), that
> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> disappear concurrently. Even if not, the CPU cost of this may be
> significant. I'm imagining the possibility of having many page_pools
> allocated on the system for many hardware queues, (and maybe multiple
> pp's per queue for applications like devmem TCP), and each pp looping
> over the entire xTB memory on page_pool_destroy()...
>
> My 2 cents here is that a more reasonable approach is to have the pp
> track all pages it has dma-mapped, without the problems in the
> previous iterations of this patch:
>
> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> structure (maybe xarray or rculist).
> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> pages there.
The thing is this should be a very rare event as it should apply only
when a device is removed and still has pages outstanding shouldn't it?
The problem is that maintaining a list of in-flight DMA pages will be
very costly and will make the use of page pool expensive enough that I
would worry it might be considered less than useful. Once we add too
much overhead the caching of the DMA address doesn't gain us much on
most systems in that case.
> I haven't looked deeply, but with the right data structure we may be
> able to synchronize 1, 2, and 3 without any additional locks. From a
> quick skim it seems maybe rculist and xarray can do this without
> additional locks, maybe.
>
> Like stated in the previous iterations of this approach, we should not
> be putting any hard limit on the amount of memory the pp can allocate,
> and we should not have to mess with the page->pp entry in struct page.
I agree with you on the fact that we shouldn't be setting any sort of
limit. The current approach to doing the unmapping is more-or-less the
brute force way of doing it working around the DMA api. I wonder if we
couldn't look at working with it instead and see if there wouldn't be
some way for us to reduce the overhead instead of having to do the
full scan of the page table.
One thought in that regard though would be to see if there were a way
to have the DMA API itself provide some of that info. I know the DMA
API should be storing some of that data for the mapping as we have to
go through and invalidate it if it is stored.
Another alternative would be to see if we have the option to just
invalidate the DMA side of things entirely for the device. Essentially
unregister the device from the IOMMU instead of the mappings. If that
is an option then we could look at leaving the page pool in a state
where it would essentially claim it no longer has to do the DMA unmap
operations and is just freeing the remaining lingering pages.
Powered by blists - more mailing lists