linux-kernel - RE: [PATCH v5 6/6] swiotlb: Remove pointless stride adjustment for allocations >= PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB41571DA1EE99BFAA65869024D4232@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Mon, 4 Mar 2024 03:31:34 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Petr Tesařík <petr@...arici.cz>, Robin Murphy
	<robin.murphy@....com>
CC: Christoph Hellwig <hch@....de>, Will Deacon <will@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Petr Tesarik
	<petr.tesarik1@...wei-partners.com>, "kernel-team@...roid.com"
	<kernel-team@...roid.com>, "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	Marek Szyprowski <m.szyprowski@...sung.com>, Dexuan Cui
	<decui@...rosoft.com>, Nicolin Chen <nicolinc@...dia.com>
Subject: RE: [PATCH v5 6/6] swiotlb: Remove pointless stride adjustment for
 allocations >= PAGE_SIZE

From: Petr Tesařík <petr@...arici.cz> Sent: Friday, March 1, 2024 10:42 AM
> 
> On Fri, 1 Mar 2024 17:54:06 +0000
> Robin Murphy <robin.murphy@....com> wrote:
> 
> > On 2024-03-01 5:08 pm, Petr Tesařík wrote:
> > > On Fri, 1 Mar 2024 16:39:27 +0100
> > > Petr Tesařík <petr@...arici.cz> wrote:
> > >
> > >> On Thu, 29 Feb 2024 16:47:56 +0100
> > >> Christoph Hellwig <hch@....de> wrote:
> > >>
> > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote:
> > >>>> Any thoughts on how that historical behavior should apply if
> > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask
> > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently
> > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is
> > >>>>> = PAGE_SIZE. But a non-zero min_align_mask could mandate
> > >>>> returning a buffer that is not page aligned. Perhaps do the
> > >>>> historical behavior only if alloc_align_mask and min_align_mask
> > >>>> are both zero?
> > >>>
> > >>> I think the driver setting min_align_mask is a clear indicator
> > >>> that the driver requested a specific alignment and the defaults
> > >>> don't apply.  For swiotbl_tbl_map_single as used by dma-iommu
> > >>> I'd have to tak a closer look at how it is used.
> > >>
> > >> I'm not sure it helps in this discussion, but let me dive into a bit
> > >> of ancient history to understand how we ended up here.
> > >>
> > >> IIRC this behaviour was originally motivated by limitations of PC AT
> > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow
> > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM
> > >> added a page register, but it was on a separate chip and it did not
> > >> increment when the 8237 address rolled over back to zero. Effectively,
> > >> the page register selected a 64K-aligned window of 64K buffers.
> > >> Consequently, DMA buffers could not cross a 64K physical boundary.
> > >>
> > >> Thanks to how the buddy allocator works, the 64K-boundary constraint
> > >> was satisfied by allocation size, and drivers took advantage of it when
> > >> allocating device buffers. IMO software bounce buffers simply followed
> > >> the same logic that worked for buffers allocated by the buddy allocator.
> > >>
> > >> OTOH min_align_mask was motivated by NVME which prescribes the value of
> > >> a certain number of low bits in the DMA address (for simplicity assumed
> > >> to be identical with the same bits in the physical address).
> > >>
> > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and
> > >> IIUC it is used to guarantee that unaligned transactions do not share
> > >> the IOMMU granule with another device. This whole thing is weird,
> > >> because swiotlb_tbl_map_single() is called like this:
> > >>
> > >>                  aligned_size = iova_align(iovad, size);
> > >>                  phys = swiotlb_tbl_map_single(dev, phys, size, aligned_size,
> > >>                                                iova_mask(iovad), dir, attrs);
> > >>
> > >> Here:
> > >>
> > >> * alloc_size = iova_align(iovad, size)
> > >> * alloc_align_mask = iova_mask(iovad)
> > >>
> > >> Now, iova_align() rounds up its argument to a multiple of iova granule
> > >> and iova_mask() is simply "granule - 1". This works, because granule
> > >> size must be a power of 2, and I assume it must also be >= PAGE_SIZE.
> > >>
> > >> In that case, the alloc_align_mask argument is not even needed if you
> > >> adjust the code to match documentation---the resulting buffer will be
> > >> aligned to a granule boundary by virtue of having a size that is a
> > >> multiple of the granule size.
> > >>
> > >> To sum it up:
> > >>
> > >> 1. min_align_mask is by far the most important constraint. Devices will
> > >>     simply stop working if it is not met.
> > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or
> > >>     equal to the requested size has been documented, and some drivers
> > >>     may rely on it.
> > >> 3. alloc_align_mask is a misguided fix for a bug in the above.
> > >>
> > >> Correct me if anything of the above is wrong.
> > >
> > > I thought about it some more, and I believe I know what should happen
> > > if the first two constraints appear to be mutually exclusive.
> > >
> > > First, the alignment based on size does not guarantee that the resulting
> > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must
> > > be always identical to the original buffer address.
> > >
> > > Let's take an example request like this:
> > >
> > >     TLB_SIZE       = 0x00000800
> > >     min_align_mask = 0x0000ffff
> > >     orig_addr      = 0x....1234
> > >     alloc_size     = 0x00002800
> > >
> > > Minimum alignment mask requires to keep the 1234 at the end. Allocation
> > > size requires a buffer that is aligned to 16K. Of course, there is no
> > > 16K-aligned slot with slot_address & 0x7ff == 0x200, but if IO_TLB_SHIFT
> > > was 14, it would be slot_address & 0x3fff == 0 (low IO_TLB_SHIFT are
> > > masked off). Since the SWIOTLB API does not guarantee any specific
> > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a
> > > perfectly valid bounce buffer address for this example.
> > >
> > > The caller may rightfully expect that the 16K granule containing the
> > > bounce buffer is not shared with any other user. For the above case I
> > > suggest to increase the allocation size to 0x4000 already in
> > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot
> > > address.
> >
> > That doesn't make sense - a caller asks to map some range of kernel
> > addresses and they get back a corresponding range of DMA addresses; they
> > cannot make any reasonable assumptions about DMA addresses *outside*
> > that range. In the example above, the caller has explicitly chosen not
> > to map the range xxx0000-xxx1234; if they expect the device to actually
> > access bytes in the DMA range yyy0000-yyy1234, then they should have
> > mapped the whole range starting from xxx0000 and it is their own error.
> 
> I agree that the range was not requested. But it is not wrong if
> SWIOTLB overallocates. In fact, it usually does overallocate because it
> works with slot granularity.
> 
> > SWIOTLB does not and cannot provide any memory protection itself, so
> > there is no functional benefit to automatically over-allocating, all it
> > will do is waste slots. iommu-dma *can* provide memory protection
> > between individual mappings via additional layers that SWIOTLB doesn't
> > know about, so in that case it's iommu-dma's responsibility to
> > explicitly manage whatever over-allocation is necessary at the SWIOTLB
> > level to match the IOMMU level.
> 
> I'm trying to understand what the caller expects to get if they request
> both buffer alignment (either given implicitly through mapping size or
> explicitly with an alloc_align_mask) with a min_align_mask and non-zero
> low bits covered by the buffer alignment.
> 
> In other words, if iommu_dma_map_page() gets into this situation:
> 
> * granule size is 4k
> * device specifies 64k min_align_mask
> * bit 11 of the original buffer address is non-zero
> 
> Then you ask for a pair of slots where the first slot has bit 11 == 0
> (required by alignment to granule size) and also has bit 11 == 1
> (required to preserve the lowest 16 bits of the original address).
> 
> Sure, you can fail such a mapping, but is it what the caller expects?
> 

Here's my take on tying all the threads together. There are 
four alignment combinations:

1. alloc_align_mask: zero; min_align_mask: zero
2. alloc_align_mask: zero; min_align_mask: non-zero
3. alloc_align_mask: non-zero; min_align_mask: zero/ignored
4. alloc_align_mask: non-zero; min_align_mask: non-zero

xen_swiotlb_map_page() and dma_direct_map_page() are #1 or #2
via swiotlb_map() and swiotlb_tbl_map_single()

iommu_dma_map_page() is #3 and #4 via swiotlb_tbl_map_single()

swiotlb_alloc() is #3, directly to swiotlb_find_slots()

For #1, the returned physical address has no constraints if 
the requested size is less than a page. For page size or 
greater, the discussed historical requirement for page 
alignment applies.

For #2, min_align_mask governs the bits of the returned 
physical address that must match the original address. When 
needed, swiotlb must also allocate pre-padding aligned to 
IO_TLB_SIZE that precedes the returned physical address.  A 
request size <= swiotlb_max_mapping_size() will not exceed 
IO_TLB_SEGSIZE even with the padding. The historical 
requirement for page alignment does not apply because the 
driver has explicitly used the newer min_align_mask feature.

For #3, alloc_align_mask specifies the required alignment. No 
pre-padding is needed. Per earlier comments from Robin[1], 
it's reasonable to assume alloc_align_mask (i.e., the granule) 
is >= IO_TLB_SIZE. The original address is not relevant in 
determining the alignment, and the historical page alignment 
requirement does not apply since alloc_align_mask explicitly 
states the alignment.

For #4, the returned physical address must match the bits
in the original address specified by min_align_mask.  swiotlb
swiotlb must also allocate pre-padding aligned to
alloc_align_mask that precedes the returned physical address.
Also per Robin[1], assume alloc_align_mask is >=
min_align_mask, which solves the conflicting alignment 
problem pointed out by Petr[2]. Perhaps we should add a 
"WARN_ON(alloc_align_mask < min_align_mask)" rather than 
failing depending on which bits of the original address are 
set. Again, the historical requirement for page alignment does 
not apply.

I believe Will's patch set implements everything in #2, #3, 
and #4, except my suggested WARN_ON in #4. The historical page 
alignment in #1 presumably needs to be added. Also, the current 
implementation of #4 has a bug in that IO_TLB_SEGSIZE could be 
exceeded as pointed out here[3], but Robin was OK with not 
fixing that now.

Michael

[1] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernel.org/T/#mbd31cbfbdf841336e25f37758c8af1a0b6d8f3eb 
[2] https://lore.kernel.org/linux-iommu/20240228133930.15400-1-will@kernel.org/T/#mf631679b302b1f5c7cacc82f4c15fb4b19f3dea1 
[3] https://lore.kernel.org/linux-iommu/20240221113504.7161-1-will@kernel.org/T/#m4179a909777ec751f3dc15b515617477e6682600