[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231103195949.0af884d0@meshulam.tesarici.cz>
Date: Fri, 3 Nov 2023 19:59:49 +0100
From: Petr Tesařík <petr@...arici.cz>
To: Niklas Schnelle <schnelle@...ux.ibm.com>
Cc: Christoph Hellwig <hch@....de>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Marek Szyprowski <m.szyprowski@...sung.com>,
Robin Murphy <robin.murphy@....com>,
Petr Tesarik <petr.tesarik1@...wei-partners.com>,
Ross Lagerwall <ross.lagerwall@...rix.com>,
linux-pci <linux-pci@...r.kernel.org>,
linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
Matthew Rosato <mjrosato@...ux.ibm.com>,
Halil Pasic <pasic@...ux.ibm.com>
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y
Hello Niklas,
thank you for your report. This is some great analysis!
On Fri, 03 Nov 2023 16:13:03 +0100
Niklas Schnelle <schnelle@...ux.ibm.com> wrote:
> Hi Swiotlb Maintainers,
>
> With s390 slated to use dma-iommu.c I was experimenting with
> CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
> Debugging this I've managed to narrow things down to what I believe is
> a memory corruption issue caused by overrunning the entire transient
> memory pool allocated by swiotlb. In my test this happens on the
> iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
> untrusted PCI devices.
>
> I've seen this happen only for transient pools when:
> * allocation size >=PAGE_SIZE and,
> * the original address of the mapping is not page aligned.
>
> The overrun can be seen clearly by adding a simple "tlb_addr +
> alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
> forcing PCI test devices to be untrusted. With that in place while an
> NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
> reliably trigger the overrun check and with a debug print produce
> output like the following:
>
> software IO TLB:
> transient:1
> index:1
> dma_get_min_align_mask(dev):0
> orig_addr:ac0caebe
> tlb_addr=a4d0f800
> start:a4d0f000
> end:a4d10000
> pool_size:4096
> alloc_size:4096
> offset:0
>
> The complete code used for this test is available on my
> git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
> s390 DMA API) + 2 trivial debug commits.
>
> For further analysis I've worked closely with Halil Pasic.
>
> The reason why we think this is happening seems twofold. Under a
> certain set of circumstances in the function swiotlb_find_slots():
> 1) the allocated transient pool can not fit the required bounce buffer,
I am aware that this can happen. It's in fact why the index returned by
swiotlb_search_pool_area() is checked in swiotlb_find_slots().
> and
> 2) the invocation of swiotlb_pool_find_slots() finds "a suitable
> slot" even though it should fail.
This needs fixing.
> The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
> suitable bounce buffer in the pool is that for transient pools pool-
> >slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),
> among other places in swiotlb, assumes that the nslabs of the pool is a
> multiple of IO_TLB_SEGSIZE
Ah... Yes, the transient pool size must also be a multiple of segment
size, but it is not enforced.
This reminds me of my debugging session that resulted in commit
aabd12609f91 ("swiotlb: always set the number of areas before
allocating the pool") and this conversation:
https://lore.kernel.org/linux-iommu/20230629074403.7f8ed9d6@meshulam.tesarici.cz/
> The reason for 1) is a bit more convoluted and not entirely understood
> by us. We are certain though that the function swiotlb_find_slots()
> allocates a pool with nr_slots(alloc_size), where this alloc_size is
> the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
> but for alignment reasons some slots may get "skipped over" in
> swiotlb_area_find_slots() causing the picked slots to overrun the
> allocation.
Yes, this can happen.
> Not sure how to properly fix this as the different alignment
> requirements get pretty complex quickly. So would appreciate your
> input.
I don't think it's possible to improve the allocation logic without
modifying the page allocator and/or the DMA atomic pool allocator to
take additional constraints into account.
I had a wild idea back in March, but it would require some intrusive
changes in the mm subsystem. Among other things, it would make memory
zones obsolete. I mean, people may actually like to get rid of DMA,
DMA32 and NORMAL, but you see how many nasty bugs were introduced even
by a relatively small change in SWIOTLB. Replacing memory zones with a
system based on generic physical allocation constraints would probably
blow up the universe. ;-)
Petr T
Powered by blists - more mailing lists