[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com>
Date: Wed, 27 Nov 2024 11:39:53 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: Alexander Duyck <alexander.duyck@...il.com>, Jesper Dangaard Brouer <hawk@...nel.org>, davem@...emloft.net,
kuba@...nel.org, pabeni@...hat.com, liuyonglong@...wei.com,
fanghaiqing@...wei.com, zhangkun09@...wei.com,
Robin Murphy <robin.murphy@....com>, IOMMU <iommu@...ts.linux.dev>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>, Eric Dumazet <edumazet@...gle.com>,
Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
already unbound
On Wed, Nov 27, 2024 at 1:35 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2024/11/27 7:53, Alexander Duyck wrote:
> > On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@...gle.com> wrote:
> >>
> >> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>
> >>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >>>>
> >>>>> page_pool_detached(pool);
> >>>>> pool->defer_start = jiffies;
> >>>>> pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
> >>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>>>> /* Flush pool alloc cache, as refill will check NUMA node */
> >>>>> while (pool->alloc.count) {
> >>>>> netmem = pool->alloc.cache[--pool->alloc.count];
> >>>>> - page_pool_return_page(pool, netmem);
> >>>>> + __page_pool_return_page(pool, netmem);
> >>>>> }
> >>>>> }
> >>>>> EXPORT_SYMBOL(page_pool_update_nid);
> >>>>
> >>>> Thanks for continuing to work on this :-)
> >>>
> >>> I am not sure how scalable the scanning is going to be if the memory size became
> >>> bigger, which is one of the reason I was posting it as RFC for this version.
> >>>
> >>> For some quick searching here, it seems there might be server with max ram capacity
> >>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> >>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> >>> called from the softirq context, which might mean there might be spinning of 12 secs
> >>> in the softirq context.
> >>>
> >>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> >>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> >>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> >>> if the mapping is already done or not for each page.
> >>>
> >>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> >>> for those large memory systems.
> >>>
> >>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >>>
> >>
> >> FWIW I'm also concerned about the looping of all memory on the system.
> >> In addition to the performance, I think (but not sure), that
> >> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> >> disappear concurrently. Even if not, the CPU cost of this may be
> >> significant. I'm imagining the possibility of having many page_pools
> >> allocated on the system for many hardware queues, (and maybe multiple
> >> pp's per queue for applications like devmem TCP), and each pp looping
> >> over the entire xTB memory on page_pool_destroy()...
> >>
> >> My 2 cents here is that a more reasonable approach is to have the pp
> >> track all pages it has dma-mapped, without the problems in the
> >> previous iterations of this patch:
> >>
> >> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> >> structure (maybe xarray or rculist).
> >> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> >> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> >> pages there.
> >
> > The thing is this should be a very rare event as it should apply only
> > when a device is removed and still has pages outstanding shouldn't it?
> > The problem is that maintaining a list of in-flight DMA pages will be
> > very costly and will make the use of page pool expensive enough that I
> > would worry it might be considered less than useful. Once we add too
> > much overhead the caching of the DMA address doesn't gain us much on
> > most systems in that case.
> >
> >> I haven't looked deeply, but with the right data structure we may be
> >> able to synchronize 1, 2, and 3 without any additional locks. From a
> >> quick skim it seems maybe rculist and xarray can do this without
> >> additional locks, maybe.
>
> I am not sure how the above right data structure without any additional
> locks will work, but my feeling is that the issues mentioned in [1] will
> likely apply to the above right data structure too.
>
> 1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/
>
I don't see the issues called out in the above thread conflict with
what I'm proposing. In fact, I think Jesper's suggestion works
perfectly with what I'm proposing. Maybe I'm missing something.
We can use an atomic pool->destroy_count to synchronize steps 2 and 3.
I.e. If destroy_count > 0, we don't unmap the page in
__page_pool_release_page(), and instead count on the page_pool_destroy
unmapping all the pages in the pp->dma_mapped list.
> >>
> >> Like stated in the previous iterations of this approach, we should not
> >> be putting any hard limit on the amount of memory the pp can allocate,
> >> and we should not have to mess with the page->pp entry in struct page.
>
> It would be good to be more specific about how it is done without 'messing'
> with the page->pp entry in struct page using some pseudocode or RFC if you
> call the renaming as messing.
>
Yeah, I don't think touching the page->pp entry is needed if you use
an xarray or rculist which hangs off the pp.
I'm currently working on a few bug fixes already and the devmem TCP TX
which is waiting on by a few folks; I don't think I can look into this
right now, but I'll try. If this issue hasn't been resolved by the
time I get some bandwidth, sure, I'll take a stab at it.
--
Thanks,
Mina
Powered by blists - more mailing lists