linux-kernel - Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240307150505.GA28978@lst.de>
Date: Thu, 7 Mar 2024 16:05:05 +0100
From: Christoph Hellwig <hch@....de>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: Christoph Hellwig <hch@....de>, Leon Romanovsky <leon@...nel.org>,
	Robin Murphy <robin.murphy@....com>,
	Marek Szyprowski <m.szyprowski@...sung.com>,
	Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
	Chaitanya Kulkarni <chaitanyak@...dia.com>,
	Jonathan Corbet <corbet@....net>, Jens Axboe <axboe@...nel.dk>,
	Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>,
	Yishai Hadas <yishaih@...dia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
	Kevin Tian <kevin.tian@...el.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Jérôme Glisse <jglisse@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-block@...r.kernel.org, linux-rdma@...r.kernel.org,
	iommu@...ts.linux.dev, linux-nvme@...ts.infradead.org,
	kvm@...r.kernel.org, linux-mm@...ck.org,
	Bart Van Assche <bvanassche@....org>,
	Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
	Amir Goldstein <amir73il@...il.com>,
	"josef@...icpanda.com" <josef@...icpanda.com>,
	"Martin K. Petersen" <martin.petersen@...cle.com>,
	"daniel@...earbox.net" <daniel@...earbox.net>,
	Dan Williams <dan.j.williams@...el.com>,
	"jack@...e.com" <jack@...e.com>, Zhu Yanjun <zyjzyj2000@...il.com>
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
 steps

On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > 
> > I don't think you can do without dma_addr_t storage.  In most cases
> > your can just store the dma_addr_t in the LE/BE encoded hardware
> > SGL, so no extra storage should be needed though.
> 
> RDMA (and often DRM too) generally doesn't work like that, the driver
> copies the page table into the device and then the only reason to have
> a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> eliminating long term dma_addr_t storage would be a worthwhile memory
> savings for large long lived user space memory registrations.

It's just kinda hard to do.  For aligned IOMMU mapping you'd only
have one dma_addr_t mappings (or maybe a few if P2P regions are
involved), so this probably doesn't matter.  For direct mappings
you'd have a few, but maybe the better answer is to use THP
more aggressively and reduce the number of segments.

> I wrote the list as from a single IO operation perspective, so all but
> 5 need to store a single IOVA range that could be stored in some
> simple non-dynamic memory along with whatever HW SGLs/etc are needed.
> 
> The point of 5 being different is because the driver has to provide a
> dynamically sized list of dma_addr_t's as storage until unmap. 5 is
> the only case that requires that full list.

No, all cases need to store one or more ranges.

> > > So are you thinking something more like a driver flow of:
> > > 
> > >   .. extent IO and get # aligned pages and know if there is P2P ..
> > >   dma_init_io(state, num_pages, p2p_flag)
> > >   if (dma_io_single_range(state)) {
> > >        // #2, #4
> > >        for each io()
> > > 	    dma_link_aligned_pages(state, io range)
> > >        hw_sgl = (state->iova, state->len)
> > >   } else {
> > 
> > I think what you have a dma_io_single_range should become before
> > the dma_init_io.  If we know we can't coalesce it really just is a
> > dma_map_{single,page,bvec} loop, no need for any extra state.
> 
> I imagine dma_io_single_range() to just check a flag in state.
> 
> I still want to call dma_init_io() for the non-coalescing cases
> because all the flows, regardless of composition, should be about as
> fast as dma_map_sg is today.

If all flows includes multiple non-coalesced regions that just makes
things very complicated, and that's exactly what I'd want to avoid.

> That means we need to always pre-allocate the IOVA in any case where
> the IOMMU might be active - even on a non-coalescing flow.
> 
> IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> be used and we can't just call today's dma_map_page() in a loop on the
> non-coalescing side and pay the overhead of Nx IOVA allocations.
> 
> In large part this is for RDMA, were a single P2P page in a large
> multi-gigabyte user memory registration shouldn't drastically harm the
> registration performance by falling down to doing dma_map_page, and an
> IOVA allocation, on a 4k page by page basis.

But that P2P page needs to be handled very differently, as with it
we can't actually use a single iova range.  So I'm not sure how that
is even supposed to work.  If you have

 +-------+-----+-------+
 | local | P2P | local |
 +-------+-----+-------+

you need at least 3 hw SGL entries, as the IOVA won't be contigous.

> The other thing that got hand waved here is how does dma_init_io()
> know which of the 6 states we are looking at? I imagine we probably
> want to do something like:
> 
>    struct dma_io_summarize summary = {};
>    for each io()
>         dma_io_summarize_range(&summary, io range)
>    dma_init_io(dev, &state, &summary);
>    if (state->single_range) {
>    } else {
>    }
>    dma_io_done_mapping(&state); <-- flush IOTLB once

That's why I really just want 2 cases.  If the caller guarantees the
range is coalescable and there is an IOMMU use the iommu-API like
API, else just iter over map_single/page.

> Enhancing the single sgl case is not a big change, I think. It does
> seem simplifying for the driver to not have to coalesce SGLs to detect
> the single-SGL fast-path.
> 
> > > This is not quite what you said, we split the driver flow based on
> > > needing 1 HW SGL vs need many HW SGL.
> > 
> > That's at least what I intended to say, and I'm a little curious as what
> > it came across.
> 
> Ok, I was reading the discussion more about as alignment than single
> HW SGL, I think you ment alignment as implying coalescing behavior
> implying single HW SGL..

Yes.