linux-kernel - Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to broken alignment handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240223143613.1878beb6@meshulam.tesarici.cz>
Date: Fri, 23 Feb 2024 14:36:13 +0100
From: Petr Tesařík <petr@...arici.cz>
To: Will Deacon <will@...nel.org>
Cc: Michael Kelley <mhklinux@...look.com>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>, "kernel-team@...roid.com"
 <kernel-team@...roid.com>, "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
 Christoph Hellwig <hch@....de>, Marek Szyprowski
 <m.szyprowski@...sung.com>, Robin Murphy <robin.murphy@....com>, Petr
 Tesarik <petr.tesarik1@...wei-partners.com>, Dexuan Cui
 <decui@...rosoft.com>, Nicolin Chen <nicolinc@...dia.com>
Subject: Re: [PATCH v4 1/5] swiotlb: Fix double-allocation of slots due to
 broken alignment handling

On Fri, 23 Feb 2024 12:47:43 +0000
Will Deacon <will@...nel.org> wrote:

> On Wed, Feb 21, 2024 at 11:35:44PM +0000, Michael Kelley wrote:
> > From: Will Deacon <will@...nel.org> Sent: Wednesday, February 21, 2024 3:35 AM  
> > > 
> > > Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
> > > which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
> > > checks"), causes a functional regression with vsock in a virtual machine
> > > using bouncing via a restricted DMA SWIOTLB pool.
> > > 
> > > When virtio allocates the virtqueues for the vsock device using
> > > dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
> > > allocations if 'area->index' was left unaligned by a previous allocation
> > > from the buffer:
> > > 
> > >  # Final address in brackets is the SWIOTLB address returned to the caller
> > >  | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
> > >  | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
> > >  | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask
> > > 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
> > > 
> > > This ends badly (typically buffer corruption and/or a hang) because
> > > swiotlb_alloc() is expecting a page-aligned allocation and so blindly
> > > returns a pointer to the 'struct page' corresponding to the allocation,
> > > therefore double-allocating the first half (2KiB slot) of the 4KiB page.
> > > 
> > > Fix the problem by treating the allocation alignment separately to any
> > > additional alignment requirements from the device, using the maximum
> > > of the two as the stride to search the buffer slots and taking care
> > > to ensure a minimum of page-alignment for buffers larger than a page.  
> > 
> > Could you also add some text that this patch fixes the scenario I
> > described in the other email thread?  Something like:
> > 
> > The changes to page alignment handling also fix a problem when
> > the alloc_align_mask is zero.  The page alignment handling added
> > in the two mentioned commits could force alignment to more bits
> > in orig_addr than specified by the device's DMA min_align_mask,
> > resulting in a larger offset.   Since swiotlb_max_mapping_size()
> > is based only on the DMA min_align_mask, that larger offset
> > plus the requested size could exceed IO_TLB_SEGSIZE slots, and
> > the mapping could fail when it shouldn't.  
> 
> Thanks, Michael. I can add that in.
> 
> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index b079a9a8e087..2ec2cc81f1a2 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -982,7 +982,7 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > >  		phys_to_dma_unencrypted(dev, pool->start) & boundary_mask;
> > >  	unsigned long max_slots = get_max_slots(boundary_mask);
> > >  	unsigned int iotlb_align_mask =
> > > -		dma_get_min_align_mask(dev) | alloc_align_mask;
> > > +		dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > >  	unsigned int nslots = nr_slots(alloc_size), stride;
> > >  	unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > >  	unsigned int index, slots_checked, count = 0, i;
> > > @@ -993,19 +993,18 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > >  	BUG_ON(!nslots);
> > >  	BUG_ON(area_index >= pool->nareas);
> > > 
> > > +	/*
> > > +	 * For mappings with an alignment requirement don't bother looping to
> > > +	 * unaligned slots once we found an aligned one.
> > > +	 */
> > > +	stride = get_max_slots(max(alloc_align_mask, iotlb_align_mask));
> > > +
> > >  	/*
> > >  	 * For allocations of PAGE_SIZE or larger only look for page aligned
> > >  	 * allocations.
> > >  	 */
> > >  	if (alloc_size >= PAGE_SIZE)
> > > -		iotlb_align_mask |= ~PAGE_MASK;
> > > -	iotlb_align_mask &= ~(IO_TLB_SIZE - 1);
> > > -
> > > -	/*
> > > -	 * For mappings with an alignment requirement don't bother looping to
> > > -	 * unaligned slots once we found an aligned one.
> > > -	 */
> > > -	stride = (iotlb_align_mask >> IO_TLB_SHIFT) + 1;
> > > +		stride = umax(stride, PAGE_SHIFT - IO_TLB_SHIFT + 1);  
> > 
> > Is this special handling of alloc_size >= PAGE_SIZE really needed?  
> 
> I've been wondering that as well, but please note that this code (and the
> comment) are in the upstream code, so I was erring in favour of keeping
> that while fixing the bugs. We could have an extra patch dropping it if
> we can convince ourselves that it's not adding anything, though.
> 
> > I think the comment is somewhat inaccurate. If orig_addr is non-zero, and
> > alloc_align_mask is zero, the requirement is for the alignment to match
> > the DMA min_align_mask bits in orig_addr, even if the allocation is
> > larger than a page.   And with Patch 3 of this series, the swiotlb_alloc()
> > case passes in alloc_align_mask to handle page size and larger requests.
> > So it seems like this doesn't do anything useful unless orig_addr and
> > alloc_align_mask are both zero, and there aren't any cases of that
> > after this patch series.  If the caller wants alignment, specify
> > it with alloc_align_mask.  
> 
> It's an interesting observation. Presumably the intention here is to
> reduce the cost of the linear search, but the code originates from a
> time when we didn't have iotlb_align_mask or alloc_align_mask and so I
> tend to agree that it should probably just be dropped. I'm also not even
> convinced that it works properly if the initial search index ends up
> being 2KiB (i.e. slot) aligned -- we'll end up jumping over the
> page-aligned addresses!

Originally, SWIOTLB was not used for allocations, so orig_addr was
never zero. The assumption was that if the bounce buffer should be
page-aligned, then the original buffer was also page-aligned, and the
check against iotlb_align_mask was sufficient.

> I'll add another patch to v5 which removes this check (and you've basically
> written the commit message for me, so thanks).
> 
> > >  	spin_lock_irqsave(&area->lock, flags);
> > >  	if (unlikely(nslots > pool->area_nslabs - area->used))
> > > @@ -1015,11 +1014,14 @@ static int swiotlb_search_pool_area(struct device *dev, struct io_tlb_pool *pool
> > >  	index = area->index;
> > > 
> > >  	for (slots_checked = 0; slots_checked < pool->area_nslabs; ) {
> > > -		slot_index = slot_base + index;
> > > +		phys_addr_t tlb_addr;
> > > 
> > > -		if (orig_addr &&
> > > -		    (slot_addr(tbl_dma_addr, slot_index) &
> > > -		     iotlb_align_mask) != (orig_addr & iotlb_align_mask)) {
> > > +		slot_index = slot_base + index;
> > > +		tlb_addr = slot_addr(tbl_dma_addr, slot_index);
> > > +
> > > +		if ((tlb_addr & alloc_align_mask) ||
> > > +		    (orig_addr && (tlb_addr & iotlb_align_mask) !=
> > > +				  (orig_addr & iotlb_align_mask))) {  
> > 
> > It looks like these changes will cause a mapping failure in some
> > iommu_dma_map_page() cases that previously didn't fail.  
> 
> Hmm, it's really hard to tell. This code has been quite badly broken for
> some time, so I'm not sure how far back you have to go to find a kernel
> that would work properly (e.g. for Nicolin's case with 64KiB pages).

I believe it fails exactly in the cases that previously found an
incorrectly aligned bounce buffer.

In any case, the "middle" bits (low bits but ignoring offset inside a
slot) of tlb_addr should indeed correspond to the middle bits of
orig_addr.

> 
> > Everything is made right by Patch 4 of your series, but from a
> > bisect standpoint, there will be a gap where things are worse.
> > In [1], I think Nicolin reported a crash with just this patch applied.  
> 
> In Nicolin's case, I think it didn't work without the patch either, this
> just triggered the failure earlier.
> 
> > While the iommu_dma_map_page() case can already fail due to
> > "too large" requests because of not setting a max mapping size,
> > this patch can cause smaller requests to fail as well until Patch 4
> > gets applied.  That might be problem to avoid, perhaps by
> > merging the Patch 4 changes into this patch.  
> 
> I'll leave this up to Christoph. Personally, I'm keen to avoid having
> a giant patch trying to fix all the SWIOTLB allocation issues in one go,
> as it will inevitably get reverted due to a corner case that we weren't
> able to test properly, breaking the common cases at the same time.

I tend to think that more patches are better, even though this patch
alone does introduce some regressions.

Petr T