linux-kernel - Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UfGmR9B7WBjANvZ9=dxbsWXDRgpaNAMJWGW4Uj4ueiHJg@mail.gmail.com>
Date: Wed, 27 Nov 2024 08:27:00 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Robin Murphy <robin.murphy@....com>
Cc: Yunsheng Lin <linyunsheng@...wei.com>, Mina Almasry <almasrymina@...gle.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com, 
	liuyonglong@...wei.com, fanghaiqing@...wei.com, zhangkun09@...wei.com, 
	IOMMU <iommu@...ts.linux.dev>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Eric Dumazet <edumazet@...gle.com>, Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has
 already unbound

On Wed, Nov 27, 2024 at 7:31 AM Robin Murphy <robin.murphy@....com> wrote:
>
> On 27/11/2024 9:35 am, Yunsheng Lin wrote:
> > On 2024/11/27 7:53, Alexander Duyck wrote:
> >> On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@...gle.com> wrote:
> >>>
> >>> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>>
> >>>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >>>>>
> >>>>>>        page_pool_detached(pool);
> >>>>>>        pool->defer_start = jiffies;
> >>>>>>        pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> >>>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>>>>>        /* Flush pool alloc cache, as refill will check NUMA node */
> >>>>>>        while (pool->alloc.count) {
> >>>>>>            netmem = pool->alloc.cache[--pool->alloc.count];
> >>>>>> -        page_pool_return_page(pool, netmem);
> >>>>>> +        __page_pool_return_page(pool, netmem);
> >>>>>>        }
> >>>>>>    }
> >>>>>>    EXPORT_SYMBOL(page_pool_update_nid);
> >>>>>
> >>>>> Thanks for continuing to work on this :-)
> >>>>
> >>>> I am not sure how scalable the scanning is going to be if the memory size became
> >>>> bigger, which is one of the reason I was posting it as RFC for this version.
> >>>>
> >>>> For some quick searching here, it seems there might be server with max ram capacity
> >>>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> >>>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> >>>> called from the softirq context, which might mean there might be spinning of 12 secs
> >>>> in the softirq context.
> >>>>
> >>>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> >>>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> >>>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> >>>> if the mapping is already done or not for each page.
> >>>>
> >>>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> >>>> for those large memory systems.
> >>>>
> >>>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >>>>
> >>>
> >>> FWIW I'm also concerned about the looping of all memory on the system.
> >>> In addition to the performance, I think (but not sure), that
> >>> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> >>> disappear concurrently. Even if not, the CPU cost of this may be
> >>> significant. I'm imagining the possibility of having many page_pools
> >>> allocated on the system for many hardware queues, (and maybe multiple
> >>> pp's per queue for applications like devmem TCP), and each pp looping
> >>> over the entire xTB memory on page_pool_destroy()...
> >>>
> >>> My 2 cents here is that a more reasonable approach is to have the pp
> >>> track all pages it has dma-mapped, without the problems in the
> >>> previous iterations of this patch:
> >>>
> >>> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> >>> structure (maybe xarray or rculist).
> >>> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> >>> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> >>> pages there.
> >>
> >> The thing is this should be a very rare event as it should apply only
> >> when a device is removed and still has pages outstanding shouldn't it?
> >> The problem is that maintaining a list of in-flight DMA pages will be
> >> very costly and will make the use of page pool expensive enough that I
> >> would worry it might be considered less than useful. Once we add too
> >> much overhead the caching of the DMA address doesn't gain us much on
> >> most systems in that case.
> >>
> >>> I haven't looked deeply, but with the right data structure we may be
> >>> able to synchronize 1, 2, and 3 without any additional locks. From a
> >>> quick skim it seems maybe rculist and xarray can do this without
> >>> additional locks, maybe.
> >
> > I am not sure how the above right data structure without any additional
> > locks will work, but my feeling is that the issues mentioned in [1] will
> > likely apply to the above right data structure too.
> >
> > 1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/
> >
> >>>
> >>> Like stated in the previous iterations of this approach, we should not
> >>> be putting any hard limit on the amount of memory the pp can allocate,
> >>> and we should not have to mess with the page->pp entry in struct page.
> >
> > It would be good to be more specific about how it is done without 'messing'
> > with the page->pp entry in struct page using some pseudocode or RFC if you
> > call the renaming as messing.
> >
> >>
> >> I agree with you on the fact that we shouldn't be setting any sort of
> >> limit. The current approach to doing the unmapping is more-or-less the
> >> brute force way of doing it working around the DMA api. I wonder if we
> >> couldn't look at working with it instead and see if there wouldn't be
> >> some way for us to reduce the overhead instead of having to do the
> >> full scan of the page table.
> >>
> >> One thought in that regard though would be to see if there were a way
> >> to have the DMA API itself provide some of that info. I know the DMA
> >> API should be storing some of that data for the mapping as we have to
> >> go through and invalidate it if it is stored.
> >>
> >> Another alternative would be to see if we have the option to just
> >> invalidate the DMA side of things entirely for the device. Essentially
> >> unregister the device from the IOMMU instead of the mappings. If that
> >> is an option then we could look at leaving the page pool in a state
> >> where it would essentially claim it no longer has to do the DMA unmap
> >> operations and is just freeing the remaining lingering pages.
> >
> > If we are going to 'invalidate the DMA side of things entirely for the
> > device', synchronization from page_pool might just go to the DMA core as
> > concurrent calling for dma unmapping API and 'invalidating' operation still
> > exist. If the invalidating is a common feature, perhaps it makes sense to
> > do that in the DMA core, otherwise it might just add unnecessary overhead
> > for other callers of DMA API.
> >
> > As mentioned by Robin in [2], the DMA core seems to make a great deal of
> > effort to catch DMA API misuse in kernel/dma/debug.c, it seems doing the
> > above might invalidate some of the dma debug checking.
>
> Has nobody paused to consider *why* dma-debug is an optional feature
> with an explicit performance warning? If you're concerned about the
> impact of keeping track of DMA mappings within the confines of the
> page_pool usage model, do you really imagine it could somehow be cheaper
> to keep track of them at the generic DMA API level without the benefit
> of any assumptions at all?

I get what you are saying, but there are things about the internal
implementations of the individual DMA APIs that might make them much
easier to destroy mappings and essentially invalidate them. For
example if the system doesn't have an IOMMU there isn't much that
actually has to be retained. At most it might be a bitmap for the
SWIOTLB that would have to be retained per device and that could be
used to invalidate the mappings assuming the device has been wiped out
and is somehow actually using the SWIOTLB.

In the case of an IOMMU there are many who run with iommu=pt which
will identity map the entire system memory and then just hand out
individual physical addresses from that range. In reality that should
be almost as easy to handle as the non-iommu case so why shouldn't we
take advantage of that to clean up this use case?

> Yes, in principle we could add a set of "robust" DMA APIs which make
> sure racy sync calls are safe and offer a "clean up all my outstanding
> mappings" op, and they would offer approximately as terrible performance
> as the current streaming APIs with dma-debug enabled, because it would
> be little different from what dma-debug already does. The checks
> themselves aren't complicated; the generally-prohibitive cost lies in
> keeping track of mappings and allocations so that they *can* be checked
> internally.
>
> Whatever you think is hard to do in the page_pool code to fix that
> code's own behaviour, it's even harder to do from elsewhere with less
> information.

My general thought would be to see if there is anything we could
explore within the DMA API itself to optimize the handling for this
sort of bulk unmap request. If not we could fall back to an approach
that requires more overhead and invalidation of individual pages.

You could think of it like the approach that has been taken with
DEFINED_DMA_UNMAP_ADDR/LEN. Basically there are cases where this can
be done much more quickly and it is likely we can clean up large
swaths in one go. So why not expose a function that might be able to
take advantage of that for exception cases like this surprise device
removal.