linux-kernel - Re: DMA mappings and crossing boundaries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <97be22364c8b90ca6caba7e2596d6cf90f28136f.camel@kernel.crashing.org>
Date:   Fri, 20 Jul 2018 13:12:56 +1000
From:   Benjamin Herrenschmidt <benh@...nel.crashing.org>
To:     Robin Murphy <robin.murphy@....com>, Christoph Hellwig <hch@....de>
Cc:     Russell Currey <ruscur@....ibm.com>,
        iommu@...ts.linux-foundation.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jens Axboe <jens.axboe@...cle.com>
Subject: Re: DMA mappings and crossing boundaries

On Wed, 2018-07-11 at 14:51 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-07-04 at 13:57 +0100, Robin Murphy wrote:
> > 
> > > Could it ? I wouldn't think dma_map_page is allows to cross page
> > > boundaries ... what about single() ? The main worry is people using
> > > these things on kmalloc'ed memory
> > 
> > Oh, absolutely - the underlying operation is just "prepare for DMA 
> > to/from this physically-contiguous region"; the only real difference 
> > between map_page and map_single is for the sake of the usual "might be 
> > highmem" vs. "definitely lowmem" dichotomy. Nobody's policing any limits 
> > on the size and offset parameters (in fact, if anyone asks I would say 
> > the outcome of the big "offset > PAGE_SIZE" debate for dma_map_sg a few 
> > months back is valid for dma_map_page too, however silly it may seem).
> 
> I think this is a very bad idea though. We should thrive to avoid
> coalescing before the iommu layer and we should avoid creating sglists
> that cross natural alignment boundaries.

Ping ? Jens, Christoph ?

I really really don't like how sg_alloc_table_from_pages() coalesces
the sglist before it gets mapped. This will potentially create huge
constraints and fragmentation in the IOMMU allocator by passing large
chunks to it.

> I had a look at sg_alloc_table_from_pages() and it basically freaks me
> out.
> 
> Back in the old days, we used to have the block layer aggressively
> coalesce requests iirc, but that was changed to let the iommu layer do
> it.
> 
> If you pass to dma_map_sg() something already heavily coalesced, such
> as what sg_alloc_table_from_pages() can do since it seems to have
> absolutely no limits there, you will create a significant fragmentation
> problem in the iommu allocator.
> 
> Instead, we should probably avoid such coalescing at that level and
> instead let the iommu opportunistically coalesce at the point of
> mapping which it does just fine.
> 
> We've been racking our brains here and we can't find a way to implement
> something we want that is both very performance efficient (no global
> locks etc...) and works with boundary crossing mappings.
> 
> We could always fallback to classic small page mappings but that is
> quite suboptimal.
> 
> I also notice that sg_alloc_table_from_pages() doesn't try to prevent
> crossing the 4G boundary. I remember pretty clearly that it was
> explicitely forbidden to create such a crossing.
> 
> > Of course, given that the allocators tend to give out size/order-aligned 
> > chunks, I think you'd have to be pretty tricksy to get two allocations 
> > to line up either side of a large power-of-two boundary *and* go out of 
> > your way to then make a single request spanning both, but it's certainly 
> > not illegal. Realistically, the kind of "scrape together a large buffer 
> > from smaller pieces" code which is liable to hit a boundary-crossing 
> > case by sheer chance is almost certainly going to be taking the 
> > sg_alloc_table_from_pages() + dma_map_sg() route for convenience, rather 
> > than implementing its own merging and piecemeal mapping.
> 
> Yes and I  think what sg_alloc_table_from_pages() does is quite wrong.
> 
> > > > Conceptually it looks pretty easy to extend the allocation constraints
> > > > to cope with that - even the pathological worst case would have an
> > > > absolute upper bound of 3 IOMMU entries for any one physical region -
> > > > but if in practice it's a case of mapping arbitrary CPU pages to 32-bit
> > > > DMA addresses having only 4 1GB slots to play with, I can't really see a
> > > > way to make that practical :(
> > > 
> > > No we are talking about 40-ish-bits of address space, so there's a bit
> > > of leeway. Of course no scheme will work if the user app tries to map
> > > more than the GPU can possibly access.
> > > 
> > > But with newer AMD adding a few more bits and nVidia being at 47-bits,
> > > I think we have some margin, it's just that they can't reach our
> > > discontiguous memory with a normal 'bypass' mapping and I'd rather not
> > > teach Linux about every single way our HW can scatter memory accross
> > > nodes, so an "on demand" mechanism is by far the most flexible way to
> > > deal with all configurations.
> > > 
> > > > Maybe the best compromise would be some sort of hybrid scheme which
> > > > makes sure that one of the IOMMU entries always covers the SWIOTLB
> > > > buffer, and invokes software bouncing for the awkward cases.
> > > 
> > > Hrm... not too sure about that. I'm happy to limit that scheme to well
> > > known GPU vendor/device IDs, and SW bouncing is pointless in these
> > > cases. It would be nice if we could have some kind of guarantee that a
> > > single mapping or sglist entry never crossed a specific boundary
> > > though... We more/less have that for 4G already (well, we are supposed
> > > to at least). Who are the main potential problematic subsystems here ?
> > > I'm thinking network skb allocation pools ... and page cache if it
> > > tries to coalesce entries before issuing the map request, does it ?
> > 
> > I don't know of anything definite off-hand, but my hunch is to be most 
> > wary of anything wanting to do zero-copy access to large buffers in 
> > userspace pages. In particular, sg_alloc_table_from_pages() lacks any 
> > kind of boundary enforcement (and most all users don't even use the 
> > segment-length-limiting variant either), so I'd say any caller of that 
> > currently has a very small, but nonzero, probability of spoiling your day.
> 
> And I'm starting to think we should just fix them.
> 
> Cheers,
> Ben.
> 
> > Robin.
> >