linux-kernel - Memory corruption with CONFIG_SWIOTLB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <104a8c8fedffd1ff8a2890983e2ec1c26bff6810.camel@linux.ibm.com>
Date:   Fri, 03 Nov 2023 16:13:03 +0100
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Christoph Hellwig <hch@....de>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        Robin Murphy <robin.murphy@....com>,
        Petr Tesarik <petr.tesarik1@...wei-partners.com>,
        Ross Lagerwall <ross.lagerwall@...rix.com>
Cc:     linux-pci <linux-pci@...r.kernel.org>,
        linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
        Matthew Rosato <mjrosato@...ux.ibm.com>,
        Halil Pasic <pasic@...ux.ibm.com>
Subject: Memory corruption with CONFIG_SWIOTLB_DYNAMIC=y

Hi Swiotlb Maintainers,

With s390 slated to use dma-iommu.c I was experimenting with
CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place.
Debugging this I've managed to narrow things down to what I believe is
a memory corruption issue caused by overrunning the entire transient
memory pool allocated by swiotlb. In my test this happens on the
iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling
untrusted PCI devices.

I've seen this happen only for transient pools when:
*  allocation size >=PAGE_SIZE and,
*  the original address of the mapping is not page aligned. 

The overrun can be seen clearly by adding a simple "tlb_addr +
alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and
forcing PCI test devices to be untrusted. With that in place while an
NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF
reliably trigger the overrun check and with a debug print produce
output like the following:

software IO TLB:
	transient:1
	index:1
	dma_get_min_align_mask(dev):0
	orig_addr:ac0caebe
	tlb_addr=a4d0f800
	start:a4d0f000
	end:a4d10000
	pool_size:4096
	alloc_size:4096
	offset:0

The complete code used for this test is available on my
git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for
s390 DMA API) + 2 trivial debug commits.

For further analysis I've worked closely with Halil Pasic.

The reason why we think this is happening seems twofold. Under a
certain set of circumstances in the function swiotlb_find_slots():
1) the allocated transient pool can not fit the required bounce buffer,
and
2) the invocation of swiotlb_pool_find_slots() finds "a suitable
slot" even though it should fail.

The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a
suitable bounce buffer in the pool is that for transient pools pool-
>slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(),
among other places in swiotlb, assumes that the nslabs of the pool is a
multiple of IO_TLB_SEGSIZE

The reason for 1) is a bit more convoluted and not entirely understood
by us. We are certain though that the function swiotlb_find_slots()
allocates a pool with nr_slots(alloc_size), where this alloc_size is
the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(),
but for alignment reasons some slots may get "skipped over" in
swiotlb_area_find_slots() causing the picked slots to overrun the
allocation.

Not sure how to properly fix this as the different alignment
requirements get pretty complex quickly. So would appreciate your
input.

Thanks,
Niklas

[0] bounce-min branch of my git.kernel.org repository at
    https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git