linux-kernel - Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <80e36f86-408d-48bf-8c7c-82c057f18cdb@www.fastmail.com>
Date:   Mon, 09 Aug 2021 21:57:52 +0200
From:   "Sven Peter" <sven@...npeter.dev>
To:     "Robin Murphy" <robin.murphy@....com>,
        iommu@...ts.linux-foundation.org
Cc:     "Arnd Bergmann" <arnd@...nel.org>, "Will Deacon" <will@...nel.org>,
        "Hector Martin" <marcan@...can.st>, linux-kernel@...r.kernel.org,
        "Alexander Graf" <graf@...zon.com>,
        "Mohamed Mediouni" <mohamed.mediouni@...amail.com>
Subject: Re: [RFC PATCH 2/3] iommu/dma-iommu: Support iovad->granule > PAGE_SIZE

Hi,

On Mon, Aug 9, 2021, at 20:37, Robin Murphy wrote:
> On 2021-08-07 09:41, Sven Peter wrote:
> > Hi,
> > 
> > Thanks a lot for quick reply!
> > 
> > On Fri, Aug 6, 2021, at 20:04, Robin Murphy wrote:
> >> On 2021-08-06 16:55, Sven Peter via iommu wrote:
> >>> DMA IOMMU domains can support hardware where the IOMMU page size is
> >>> larger than the CPU page size.
> >>> Alignments need to be done with respect to both PAGE_SIZE and
> >>> iovad->granule. Additionally, the sg list optimization to use a single
> >>> IOVA allocation cannot be used in those cases since the physical
> >>> addresses will very likely not be aligned to the larger IOMMU page size.
> >>>
> >>> Signed-off-by: Sven Peter <sven@...npeter.dev>
> >>> ---
> >>>    drivers/iommu/dma-iommu.c | 87 ++++++++++++++++++++++++++++++++++-----
> >>>    1 file changed, 77 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> >>> index 6f0df629353f..e072d9030d9f 100644
> >>> --- a/drivers/iommu/dma-iommu.c
> >>> +++ b/drivers/iommu/dma-iommu.c
> >>> @@ -8,6 +8,7 @@
> >>>     * Copyright (C) 2000-2004 Russell King
> >>>     */
> >>>    
> >>> +#include <linux/align.h>
> >>>    #include <linux/acpi_iort.h>
> >>>    #include <linux/device.h>
> >>>    #include <linux/dma-map-ops.h>
> >>> @@ -51,6 +52,15 @@ struct iommu_dma_cookie {
> >>>    	struct iommu_domain		*fq_domain;
> >>>    };
> >>>    
> >>> +/* aligns size to CPU and IOMMU page size */
> >>> +static inline size_t iommu_page_align(struct device *dev, size_t size)
> >>> +{
> >>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> >>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> >>> +
> >>> +	return iova_align(&cookie->iovad, PAGE_ALIGN(size));
> >>> +}
> >>> +
> >>>    static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
> >>>    bool iommu_dma_forcedac __read_mostly;
> >>>    
> >>> @@ -647,6 +657,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> >>>    /*
> >>>     * If size is less than PAGE_SIZE, then a full CPU page will be allocated,
> >>>     * but an IOMMU which supports smaller pages might not map the whole thing.
> >>> + * If the IOMMU page size is larger than the CPU page size, then the size
> >>> + * will be aligned to that granularity and some memory will be left unused.
> >>
> >> Why do we need to increase the actual memory allocation? The point here
> >> is that we allocate the smallest thing we can allocate and map the
> >> smallest thing we can map - I think that still works the "wrong" way
> >> round too, we should just need to start taking an IOVA offset into
> >> account as in dma_map_page() if we can no longer assume it's 0 for a CPU
> >> page. Sure we may expose some unrelated adjacent pages, but we'll
> >> already be doing that to excess for streaming DMA so whoop de do.
> > 
> > I agree for trusted devices, but untrusted ones (Thunderbolt, and depending on your
> > risk tolerance possibly even the broadcom wifi) might also end up calling this.
> 
> Oh, right, I hadn't considered actual untrusted device support at this 
> stage.


Me neither :-)
I did the alignment at first without thinking too much about it,
then read your reply, and only *then* realized that there are untrusted devices
for which this just happens to do the right thing (at the cost of wasting
memory for everyone else, but I'll fix that).

> 
> > For streaming DMA swiotlb will make sure that these won't see memory
> > they're not supposed to access.
> 
> I was slightly surprised to see that that does appear to work out OK, 
> but I guess SWIOTLB slots are already smaller than typical IOMMU pages, 
> so it falls out of that. Neat.
> 
> > But, at least as far as I understand it, no swiotlb is in the way to catch devices
> > who end up calling this function. That wasn't required because we used to get
> > PAGE_SIZE aligned allocation here and every IOMMU so far would be able to easily
> > map them without any spill overs.
> > But now we'll end up exposing three more unrelated pages if the allocation
> > is not increased.
> 
> Fair enough, but then why still waste memory in the (usual) cases where 
> it logically *isn't* necessary?

See above, didn't even consider this either.

I should be able to fix this by first allocating <= 3 pages to reach a
iovad->granule boundary (or just satisfy the full allocation length already),
adjust the DMA offset forward for this first allocation and then just keep
allocating iovad->granule blocks for the rest of the requested size.

The current code already does the second part and I think I'll just need
to add the first alignment part.

> 
> >>>     */
> >>>    static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
> >>>    		size_t size, struct sg_table *sgt, gfp_t gfp, pgprot_t prot,
> >>> @@ -736,7 +748,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
> >>>    
> >>>    out_unmap:
> >>>    	__iommu_dma_unmap(dev, *dma_handle, size);
> >>> -	__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> >>> +	__iommu_dma_free_pages(pages, iommu_page_align(dev, size) >> PAGE_SHIFT);
> >>>    	return NULL;
> >>>    }
> >>>    
> >>> @@ -766,7 +778,8 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
> >>>    	struct dma_sgt_handle *sh = sgt_handle(sgt);
> >>>    
> >>>    	__iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
> >>> -	__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
> >>> +	__iommu_dma_free_pages(sh->pages,
> >>> +		iommu_page_align(dev, size) >> PAGE_SHIFT);
> >>>    	sg_free_table(&sh->sgt);
> >>>    	kfree(sh);
> >>>    }
> >>> @@ -1006,6 +1019,31 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>    	if (dev_is_untrusted(dev))
> >>>    		return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);
> >>>    
> >>> +	/*
> >>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we will
> >>> +	 * very likely run into sgs with a physical address that is not aligned
> >>> +	 * to an IOMMU page boundary. Fall back to just mapping every entry
> >>> +	 * independently with __iommu_dma_map then.
> >>
> >> Scatterlist segments often don't have nicely aligned ends, which is why
> >> we already align things to IOVA granules in main loop here. I think in
> >> principle we'd just need to move the non-IOVA-aligned part of the
> >> address from sg->page to sg->offset in the temporary transformation for
> >> the rest of the assumptions to hold. I don't blame you for being timid
> >> about touching that, though - it took me 3 tries to get right when I
> >> first wrote it...
> > 
> > I was a little bit afraid I'd get this exact reply :D
> > I'll try to modify the transformation again, but I'm sure it'll take me more than
> > 3 tries to get it right :)
> > 
> >>
> >>> +	 */
> >>> +	if (iovad->granule > PAGE_SIZE) {
> >>> +		for_each_sg(sg, s, nents, i) {
> >>> +			sg_dma_address(s) = __iommu_dma_map(dev, sg_phys(s),
> >>> +				s->length, prot, dma_get_mask(dev));
> >>> +			if (sg_dma_address(s) == DMA_MAPPING_ERROR)
> >>> +				break;
> >>> +			sg_dma_len(s) = s->length;
> >>> +		}
> >>> +
> >>> +		if (unlikely(i != nents)) {
> >>> +			nents = i;
> >>> +			for_each_sg(sg, s, nents, i)
> >>> +				__iommu_dma_unmap(dev, sg_dma_address(s), sg_dma_len(s));
> >>> +			return 0;
> >>> +		}
> >>> +
> >>> +		return nents;
> >>> +	}
> >>
> >> Either way, NAK to having a *third* implementation of SG mapping in this
> >> file which is fundamentally no different from the second one.
> > 
> > Good point. If for some reason I'm not able to modify the transformation correctly
> > I'll just fall back to iommu_dma_map_sg_swiotlb.
> > 
> >>
> >>> +
> >>>    	/*
> >>>    	 * Work out how much IOVA space we need, and align the segments to
> >>>    	 * IOVA granules for the IOMMU driver to handle. With some clever
> >>> @@ -1068,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> >>>    static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >>>    		int nents, enum dma_data_direction dir, unsigned long attrs)
> >>>    {
> >>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> >>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> >>> +	struct iova_domain *iovad = &cookie->iovad;
> >>>    	dma_addr_t start, end;
> >>>    	struct scatterlist *tmp;
> >>>    	int i;
> >>> @@ -1080,6 +1121,17 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> >>>    		return;
> >>>    	}
> >>>    
> >>> +	/*
> >>> +	 * If the IOMMU pagesize is larger than the CPU pagesize we mapped
> >>> +	 * every entry indepedently with __iommu_dma_map then. Let's do the
> >>> +	 * opposite here.
> >>> +	 */
> >>> +	if (iovad->granule > PAGE_SIZE) {
> >>> +		for_each_sg(sg, tmp, nents, i)
> >>> +			__iommu_dma_unmap(dev, sg_dma_address(tmp), sg_dma_len(tmp));
> >>> +		return;
> >>> +	}
> >>
> >> As above, this is just __iommu_dma_unmap_sg_swiotlb() with fewer clothes on.
> >>
> >>> +
> >>>    	/*
> >>>    	 * The scatterlist segments are mapped into a single
> >>>    	 * contiguous IOVA allocation, so this is incredibly easy.
> >>> @@ -1110,7 +1162,7 @@ static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
> >>>    
> >>>    static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
> >>>    {
> >>> -	size_t alloc_size = PAGE_ALIGN(size);
> >>> +	size_t alloc_size = iommu_page_align(dev, size);
> >>>    	int count = alloc_size >> PAGE_SHIFT;
> >>>    	struct page *page = NULL, **pages = NULL;
> >>>    
> >>> @@ -1150,7 +1202,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
> >>>    		struct page **pagep, gfp_t gfp, unsigned long attrs)
> >>>    {
> >>>    	bool coherent = dev_is_dma_coherent(dev);
> >>> -	size_t alloc_size = PAGE_ALIGN(size);
> >>> +	size_t alloc_size = iommu_page_align(dev, size);
> >>>    	int node = dev_to_node(dev);
> >>>    	struct page *page = NULL;
> >>>    	void *cpu_addr;
> >>> @@ -1201,8 +1253,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
> >>>    
> >>>    	if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
> >>>    	    !gfpflags_allow_blocking(gfp) && !coherent)
> >>> -		page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> >>> -					       gfp, NULL);
> >>> +		page = dma_alloc_from_pool(dev, iommu_page_align(dev, size),
> >>> +					       &cpu_addr, gfp, NULL);
> >>>    	else
> >>>    		cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
> >>>    	if (!cpu_addr)
> >>> @@ -1253,6 +1305,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    		void *cpu_addr, dma_addr_t dma_addr, size_t size,
> >>>    		unsigned long attrs)
> >>
> >> Can we just not bother trying to support this? TBH I don't know exactly
> >> how the interface is supposed to work - what you're doing here looks
> >> like it's probably either too much or not enough, depending on whether
> >> the address and size arguments are supposed to allow representing
> >> partial buffers - and frankly I can't imagine you'll be needing to
> >> support dma-buf exports from the USB/ethernet/wifi drivers any time soon...
> > 
> > I'm not really sure how this is going to work even before my changes.
> > Just from reading the code it looks like even then it might be doing too much
> > or too little.
> > But this will also be used for Thunderbolt and who knows what kind of devices
> > will be connected there. I'm fine with just not supporting this interface unless
> > something actually breaks for some user though.
> 
> I would bet that the set of random Thunderbolt-attachable devices which 
> participate in dma-buf exports and the set of media devices which expect 
> vb2_dma_contig to work for arbitrary user buffers have a non-empty 
> intersection. If you eat your cake you may subsequently discover that 
> you no longer have your cake ;)

:D

> 
> If we can't easily test it or reason about it, let's not pretend to 
> implement it. I'd rather somebody discovered the lack of working support 
> in a few years' time because it reliably and safely returned an error, 
> rather than because it ate their page cache. Besides, it's not like 
> people can't use a kernel built with the right PAGE_SIZE (which might 
> perform better anyway) and not have the problem in the first place.

Good points, I'll just add if (iovad->granule > PAGE_SIZE) return -ENXIO;
here which should be equivalent to having no get_sgtable in dma_map_ops.

> 
> >>>    {
> >>> +	size_t aligned_size = iommu_page_align(dev, size);
> >>>    	struct page *page;
> >>>    	int ret;
> >>>    
> >>> @@ -1261,7 +1314,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    
> >>>    		if (pages) {
> >>>    			return sg_alloc_table_from_pages(sgt, pages,
> >>> -					PAGE_ALIGN(size) >> PAGE_SHIFT,
> >>> +					aligned_size >> PAGE_SHIFT,
> >>>    					0, size, GFP_KERNEL);
> >>>    		}
> >>>    
> >>> @@ -1272,7 +1325,7 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> >>>    
> >>>    	ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
> >>>    	if (!ret)
> >>> -		sg_set_page(sgt->sgl, page, PAGE_ALIGN(size), 0);
> >>> +		sg_set_page(sgt->sgl, page, aligned_size, 0);
> >>>    	return ret;
> >>>    }
> >>>    
> >>> @@ -1283,11 +1336,25 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
> >>>    	return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
> >>>    }
> >>>    
> >>> +static struct page *iommu_dma_alloc_aligned_pages(struct device *dev, size_t size,
> >>> +		dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> >>> +{
> >>> +	size = iommu_page_align(dev, size);
> >>> +	return dma_common_alloc_pages(dev, size, dma_handle, dir, gfp);
> >>> +}
> >>> +
> >>> +static void iommu_dma_free_aligned_pages(struct device *dev, size_t size, struct page *page,
> >>> +		dma_addr_t dma_handle, enum dma_data_direction dir)
> >>> +{
> >>> +	size = iommu_page_align(dev, size);
> >>> +	return dma_common_free_pages(dev, size, page, dma_handle, dir);
> >>> +}
> >>
> >> Again, what's the point of these? iommu_dma_map_page() still has to cope
> >> with whatever the caller provides, so there's no difference in the one
> >> case when that caller happens to be dma_common_map_pages().
> > 
> > Same as above, untrusted devices.
> 
> Again fair enough, but in that case do it for untrusted devices. Not for 
> the whole world for most of whom it still *is* a needless waste.

Yeah, agreed!

> 
> Robin.
> 


Best,

Sven