linux-kernel - Re: [PATCH v1 00/16] dma-mapping: migrate to physical address-based API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250809133454.GP184255@nvidia.com>
Date: Sat, 9 Aug 2025 10:34:54 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Marek Szyprowski <m.szyprowski@...sung.com>
Cc: Leon Romanovsky <leon@...nel.org>,
	Abdiel Janulgue <abdiel.janulgue@...il.com>,
	Alexander Potapenko <glider@...gle.com>,
	Alex Gaynor <alex.gaynor@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Christoph Hellwig <hch@....de>, Danilo Krummrich <dakr@...nel.org>,
	iommu@...ts.linux.dev, Jason Wang <jasowang@...hat.com>,
	Jens Axboe <axboe@...nel.dk>, Joerg Roedel <joro@...tes.org>,
	Jonathan Corbet <corbet@....net>, Juergen Gross <jgross@...e.com>,
	kasan-dev@...glegroups.com, Keith Busch <kbusch@...nel.org>,
	linux-block@...r.kernel.org, linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	linux-nvme@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org,
	linux-trace-kernel@...r.kernel.org,
	Madhavan Srinivasan <maddy@...ux.ibm.com>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Michael Ellerman <mpe@...erman.id.au>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	Miguel Ojeda <ojeda@...nel.org>,
	Robin Murphy <robin.murphy@....com>, rust-for-linux@...r.kernel.org,
	Sagi Grimberg <sagi@...mberg.me>,
	Stefano Stabellini <sstabellini@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	virtualization@...ts.linux.dev, Will Deacon <will@...nel.org>,
	xen-devel@...ts.xenproject.org
Subject: Re: [PATCH v1 00/16] dma-mapping: migrate to physical address-based
 API

On Fri, Aug 08, 2025 at 08:51:08PM +0200, Marek Szyprowski wrote:
> First - basing the API on the phys_addr_t.
> 
> Page based API had the advantage that it was really hard to abuse it and 
> call for something that is not 'a normal RAM'. 

This is not true anymore. Today we have ZONE_DEVICE as a struct page
type with a whole bunch of non-dram sub-types:

enum memory_type {
	/* 0 is reserved to catch uninitialized type fields */
	MEMORY_DEVICE_PRIVATE = 1,
	MEMORY_DEVICE_COHERENT,
	MEMORY_DEVICE_FS_DAX,
	MEMORY_DEVICE_GENERIC,
	MEMORY_DEVICE_PCI_P2PDMA,
};

Few of which are kmappable/page_to_virtable() in a way that is useful
for the DMA API.

DMA API sort of ignores all of this and relies on the caller to not
pass in an incorrect struct page. eg we rely on things like the block
stack to do the right stuff when a MEMORY_DEVICE_PCI_P2PDMA is present
in a bio_vec.

Which is not really fundamentally different from just using
phys_addr_t in the first place.

Sure, this was a stronger argument when this stuff was originally
written, before ZONE_DEVICE was invented.

> I initially though that phys_addr_t based API will somehow simplify
> arch specific implementation, as some of them indeed rely on
> phys_addr_t internally, but I missed other things pointed by
> Robin. Do we have here any alternative?

I think it is less of a code simplification, more as a reduction in
conceptual load. When we can say directly there is no struct page type
anyhwere in the DMA API layers then we only have to reason about
kmap/phys_to_virt compatibly.

This is also a weaker overall requirement than needing an actual
struct page which allows optimizing other parts of the kernel. Like we
aren't forced to create MEMORY_DEVICE_PCI_P2PDMA stuct pages just to
use the dma api.

Again, any place in the kernel we can get rid of struct page the
smoother the road will be for the MM side struct page restructuring.

For example one of the bigger eventual goes here is to make a bio_vec
store phys_addr_t, not struct page pointers.

DMA API is not alone here, we have been de-struct-paging the kernel
for a long time now:

netdev: https://lore.kernel.org/linux-mm/20250609043225.77229-1-byungchul@sk.com/
slab: https://lore.kernel.org/linux-mm/20211201181510.18784-1-vbabka@suse.cz/
iommmu: https://lore.kernel.org/all/0-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com/
page tables: https://lore.kernel.org/linux-mm/20230731170332.69404-1-vishal.moola@gmail.com/
zswap: https://lore.kernel.org/all/20241216150450.1228021-1-42.hyeyoo@gmail.com/

With a long term goal that struct page only exists for legacy code,
and is maybe entirely compiled out of modern server kernels.

> Second - making dma_map_phys() a single API to handle all cases.
> 
> Do we really need such single function to handle all cases? 

If we accept the direction to remove struct page then it makes little
sense to have a dma_map_ram(phys_addr) and dma_map_resource(phys_addr)
and force key callers (like block) to have more ifs - especially if
the conditional could become "free" inside the dma API (see below).

Plus if we keep the callchain split then adding a
"dma_link_resource"/etc are now needed as well.

> DMA_ATTR_MMIO for every typical DMA user? I know that branching is 
> cheap, but this will probably increase code size for most of the typical 
> users for no reason.

Well, having two call chains will increase the code size much more,
and 'resource' can't be compiled out. Arguably this unification should
reduce the .text size since many of the resource only functions go
away.

There are some branches, and I think the push toward re-using
DMA_ATTR_SKIP_CPU_SYNC was directly to try to reduce that branch
cost.

However, I think we should be looking for a design here that is "free"
on the fast no-swiotlb and non-cache-flush path. I think this can be
achieved by checking ATTR_MMIO only after seeing swiotlb is needed
(like today's is p2p check). And we can probably freely fold it into
the existing sync check:

	if ((attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)) == 0)

I saw Leon hasn't done these micro optimizations, but it seems like it
could work out.

Regards,
Jason