linux-kernel - Re: [PATCH v1 00/17] Provide a new two step DMA mapping API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241104095831.GA28751@lst.de>
Date: Mon, 4 Nov 2024 10:58:31 +0100
From: Christoph Hellwig <hch@....de>
To: Robin Murphy <robin.murphy@....com>
Cc: Leon Romanovsky <leon@...nel.org>, Jens Axboe <axboe@...nel.dk>,
	Jason Gunthorpe <jgg@...pe.ca>, Joerg Roedel <joro@...tes.org>,
	Will Deacon <will@...nel.org>, Christoph Hellwig <hch@....de>,
	Sagi Grimberg <sagi@...mberg.me>, Keith Busch <kbusch@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Logan Gunthorpe <logang@...tatee.com>,
	Yishai Hadas <yishaih@...dia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
	Kevin Tian <kevin.tian@...el.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Marek Szyprowski <m.szyprowski@...sung.com>,
	Jérôme Glisse <jglisse@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jonathan Corbet <corbet@....net>, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-block@...r.kernel.org,
	linux-rdma@...r.kernel.org, iommu@...ts.linux.dev,
	linux-nvme@...ts.infradead.org, linux-pci@...r.kernel.org,
	kvm@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 00/17] Provide a new two step DMA mapping API

On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> being able to linearise a scatterlist in DMA address space. TBH I doubt 
> there are many actual scatter-gather-capable devices with significant 
> enough limitations to meaningfully benefit from DMA segment combining these 
> days - I've often thought that by now it might be a good idea to turn that 
> behaviour off by default and add an attribute for callers to explicitly 
> request it.

Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented.  While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.

Note that dma_map_sg also has two other very useful features:  batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.

>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting 
> sent to DMA jail for open-coding:

You don't get sent to jail.  But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.

>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> those two notions in any way related? We literally already have one generic 
> DMA operation which doesn't operate on struct page, yet needed nothing 
> "split up" to be possible.

Yeah, I don't really get the struct page argument.  In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all.  I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places.  Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.

>>   2. VFIO PCI live migration code is building a very large "page list"
>>      for the device. Instead of allocating a scatter list entry per allocated
>>      page it can just allocate an array of 'struct page *', saving a large
>>      amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which 
> it explicitly manages - why is it even pretending to need a generic DMA 
> API?

AFAIK that does isn't really vfio as we know it but the control device
for live migration.  But Leon or Jason might fill in more.

The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).

>
>>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>>      list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could 
> seemingly already be done by just mapping the pages, so how is it proving 
> any benefit of a fragile new interface?

Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.

> My big concern here is that a thin and vaguely-defined wrapper around the 
> IOMMU API is itself a step which smells strongly of "abuse and design 
> mistake", given that the basic notion of allocating DMA addresses in 
> advance clearly cannot generalise. Thus it really demands some considered 
> justification beyond "We must do something; This is something; Therefore we 
> must do this." to be convincing.

At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations.  The HMM use case I'll let others talk about.