linux-kernel - Re: Memory corruption with CONFIG_SWIOTLB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20231103195949.0af884d0@meshulam.tesarici.cz>
Date:   Fri, 3 Nov 2023 19:59:49 +0100
From:   Petr Tesařík <petr@...arici.cz>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>
Cc:     Christoph Hellwig <hch@....de>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        Robin Murphy <robin.murphy@....com>,
        Petr Tesarik <petr.tesarik1@...wei-partners.com>,
        Ross Lagerwall <ross.lagerwall@...rix.com>,
        linux-pci <linux-pci@...r.kernel.org>,
        linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
        Matthew Rosato <mjrosato@...ux.ibm.com>,
        Halil Pasic <pasic@...ux.ibm.com>
Subject: Re: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hello Niklas,

thank you for your report. This is some great analysis!

On Fri, 03 Nov 2023 16:13:03 +0100
Niklas Schnelle <schnelle@...ux.ibm.com> wrote:

> Hi Swiotlb Maintainers,
> 
> With s390 slated to use dma-iommu.c I was experimenting with
> CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
> Debugging this I've managed to narrow things down to what I believe is
> a memory corruption issue caused by overrunning the entire transient
> memory pool allocated by swiotlb. In my test this happens on the
> iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
> untrusted PCI devices.
> 
> I've seen this happen only for transient pools when:
> *  allocation size >=PAGE_SIZE and,
> *  the original address of the mapping is not page aligned. 
> 
> The overrun can be seen clearly by adding a simple "tlb_addr +
> alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
> forcing PCI test devices to be untrusted. With that in place while an
> NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
> reliably trigger the overrun check and with a debug print produce
> output like the following:
> 
> software IO TLB:
> 	transient:1
> 	index:1
> 	dma_get_min_align_mask(dev):0
> 	orig_addr:ac0caebe
> 	tlb_addr=a4d0f800
> 	start:a4d0f000
> 	end:a4d10000
> 	pool_size:4096
> 	alloc_size:4096
> 	offset:0
> 
> The complete code used for this test is available on my
> git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
> s390 DMA API) + 2 trivial debug commits.
> 
> For further analysis I've worked closely with Halil Pasic.
> 
> The reason why we think this is happening seems twofold. Under a
> certain set of circumstances in the function swiotlb_find_slots():
> 1) the allocated transient pool can not fit the required bounce buffer,

I am aware that this can happen. It's in fact why the index returned by
swiotlb_search_pool_area() is checked in swiotlb_find_slots().

> and
> 2) the invocation of swiotlb_pool_find_slots() finds "a suitable
> slot" even though it should fail.

This needs fixing.

> The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
> suitable bounce buffer in the pool is that for transient pools pool-
> >slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),  
> among other places in swiotlb, assumes that the nslabs of the pool is a
> multiple of IO_TLB_SEGSIZE

Ah... Yes, the transient pool size must also be a multiple of segment
size, but it is not enforced.

This reminds me of my debugging session that resulted in commit
aabd12609f91 ("swiotlb: always set the number of areas before
allocating the pool") and this conversation:

https://lore.kernel.org/linux-iommu/20230629074403.7f8ed9d6@meshulam.tesarici.cz/

> The reason for 1) is a bit more convoluted and not entirely understood
> by us. We are certain though that the function swiotlb_find_slots()
> allocates a pool with nr_slots(alloc_size), where this alloc_size is
> the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
> but for alignment reasons some slots may get "skipped over" in
> swiotlb_area_find_slots() causing the picked slots to overrun the
> allocation.

Yes, this can happen.

> Not sure how to properly fix this as the different alignment
> requirements get pretty complex quickly. So would appreciate your
> input.

I don't think it's possible to improve the allocation logic without
modifying the page allocator and/or the DMA atomic pool allocator to
take additional constraints into account.

I had a wild idea back in March, but it would require some intrusive
changes in the mm subsystem. Among other things, it would make memory
zones obsolete. I mean, people may actually like to get rid of DMA,
DMA32 and NORMAL, but you see how many nasty bugs were introduced even
by a relatively small change in SWIOTLB. Replacing memory zones with a
system based on generic physical allocation constraints would probably
blow up the universe. ;-)

Petr T