linux-kernel - Re: DMA mappings and crossing boundaries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 4 Jul 2018 13:57:01 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     benh@....ibm.com, Christoph Hellwig <hch@....de>
Cc:     Russell Currey <ruscur@....ibm.com>,
        iommu@...ts.linux-foundation.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jens Axboe <jens.axboe@...cle.com>
Subject: Re: DMA mappings and crossing boundaries

On 02/07/18 14:37, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:
> 
>   .../...
> 
> Thanks Robin, I was starting to depair anybody would reply ;-)
> 
>>> AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
>>> HOWTO.txt) as always allocating to the next power-of-2 order, so we
>>> should never have the problem unless we allocate a single chunk larger
>>> than the IOMMU page size.
>>
>> (and even then it's not *that* much of a problem, since it comes down to
>> just finding n > 1 consecutive unused IOMMU entries for exclusive use by
>> that new chunk)
> 
> Yes, this case is not my biggest worry.
> 
>>> For dma_map_sg() however, if a request that has a single "entry"
>>> spawning such a boundary, we need to ensure that the result mapping is
>>> 2 contiguous "large" iommu pages as well.
>>>
>>> However, that doesn't fit well with us re-using existing mappings since
>>> they may already exist and either not be contiguous, or partially exist
>>> with no free hole around them.
>>>
>>> Now, we *could* possibly construe a way to solve this by detecting this
>>> case and just allocating another "pair" (or set if we cross even more
>>> pages) of IOMMU pages elsewhere, thus partially breaking our re-use
>>> scheme.
>>>
>>> But while doable, this introduce some serious complexity in the
>>> implementation, which I would very much like to avoid.
>>>
>>> So I was wondering if you guys thought that was ever likely to happen ?
>>> Do you see reasonable cases where dma_map_sg() would be called with a
>>> list in which a single entry crosses a 256M or 1G boundary ?
>>
>> For streaming mappings of buffers cobbled together out of any old CPU
>> pages (e.g. user memory), you may well happen to get two
>> physically-adjacent pages falling either side of an IOMMU boundary,
>> which comprise all or part of a single request - note that whilst it's
>> probably less likely than the scatterlist case, this could technically
>> happen for dma_map_{page, single}() calls too.
> 
> Could it ? I wouldn't think dma_map_page is allows to cross page
> boundaries ... what about single() ? The main worry is people using
> these things on kmalloc'ed memory

Oh, absolutely - the underlying operation is just "prepare for DMA 
to/from this physically-contiguous region"; the only real difference 
between map_page and map_single is for the sake of the usual "might be 
highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
on the size and offset parameters (in fact, if anyone asks I would say 
the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
months back is valid for dma_map_page too, however silly it may seem).

Of course, given that the allocators tend to give out size/order-aligned 
chunks, I think you'd have to be pretty tricksy to get two allocations 
to line up either side of a large power-of-two boundary *and* go out of 
your way to then make a single request spanning both, but it's certainly 
not illegal. Realistically, the kind of "scrape together a large buffer 
from smaller pieces" code which is liable to hit a boundary-crossing 
case by sheer chance is almost certainly going to be taking the 
sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
than implementing its own merging and piecemeal mapping.

>> Conceptually it looks pretty easy to extend the allocation constraints
>> to cope with that - even the pathological worst case would have an
>> absolute upper bound of 3 IOMMU entries for any one physical region -
>> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
>> DMA addresses having only 4 1GB slots to play with, I can't really see a
>> way to make that practical :(
> 
> No we are talking about 40-ish-bits of address space, so there's a bit
> of leeway. Of course no scheme will work if the user app tries to map
> more than the GPU can possibly access.
> 
> But with newer AMD adding a few more bits and nVidia being at 47-bits,
> I think we have some margin, it's just that they can't reach our
> discontiguous memory with a normal 'bypass' mapping and I'd rather not
> teach Linux about every single way our HW can scatter memory accross
> nodes, so an "on demand" mechanism is by far the most flexible way to
> deal with all configurations.
> 
>> Maybe the best compromise would be some sort of hybrid scheme which
>> makes sure that one of the IOMMU entries always covers the SWIOTLB
>> buffer, and invokes software bouncing for the awkward cases.
> 
> Hrm... not too sure about that. I'm happy to limit that scheme to well
> known GPU vendor/device IDs, and SW bouncing is pointless in these
> cases. It would be nice if we could have some kind of guarantee that a
> single mapping or sglist entry never crossed a specific boundary
> though... We more/less have that for 4G already (well, we are supposed
> to at least). Who are the main potential problematic subsystems here ?
> I'm thinking network skb allocation pools ... and page cache if it
> tries to coalesce entries before issuing the map request, does it ?

I don't know of anything definite off-hand, but my hunch is to be most 
wary of anything wanting to do zero-copy access to large buffers in 
userspace pages. In particular, sg_alloc_table_from_pages() lacks any 
kind of boundary enforcement (and most all users don't even use the 
segment-length-limiting variant either), so I'd say any caller of that 
currently has a very small, but nonzero, probability of spoiling your day.

Robin.