[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9798b34c-618b-4e89-82b0-803bc655c82b@amd.com>
Date: Wed, 19 Nov 2025 10:18:08 +0100
From: Christian König <christian.koenig@....com>
To: Leon Romanovsky <leon@...nel.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
Logan Gunthorpe <logang@...tatee.com>, Jens Axboe <axboe@...nel.dk>,
Robin Murphy <robin.murphy@....com>, Joerg Roedel <joro@...tes.org>,
Will Deacon <will@...nel.org>, Marek Szyprowski <m.szyprowski@...sung.com>,
Jason Gunthorpe <jgg@...pe.ca>, Andrew Morton <akpm@...ux-foundation.org>,
Jonathan Corbet <corbet@....net>, Sumit Semwal <sumit.semwal@...aro.org>,
Kees Cook <kees@...nel.org>, "Gustavo A. R. Silva" <gustavoars@...nel.org>,
Ankit Agrawal <ankita@...dia.com>, Yishai Hadas <yishaih@...dia.com>,
Shameer Kolothum <skolothumtho@...dia.com>, Kevin Tian
<kevin.tian@...el.com>, Alex Williamson <alex@...zbot.org>
Cc: Krishnakant Jaju <kjaju@...dia.com>, Matt Ochs <mochs@...dia.com>,
linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-block@...r.kernel.org, iommu@...ts.linux.dev, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-media@...r.kernel.org,
dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
kvm@...r.kernel.org, linux-hardening@...r.kernel.org
Subject: Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
On 11/11/25 10:57, Leon Romanovsky wrote:
> From: Jason Gunthorpe <jgg@...dia.com>
>
> Reflect latest changes in p2p implementation to support DMABUF lifecycle.
>
> Signed-off-by: Leon Romanovsky <leonro@...dia.com>
> Signed-off-by: Jason Gunthorpe <jgg@...dia.com>
> ---
> Documentation/driver-api/pci/p2pdma.rst | 95 +++++++++++++++++++++++++--------
> 1 file changed, 72 insertions(+), 23 deletions(-)
>
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> index d0b241628cf1..77e310596955 100644
> --- a/Documentation/driver-api/pci/p2pdma.rst
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -9,22 +9,47 @@ between two devices on the bus. This type of transaction is henceforth
> called Peer-to-Peer (or P2P). However, there are a number of issues that
> make P2P transactions tricky to do in a perfectly safe way.
>
> -One of the biggest issues is that PCI doesn't require forwarding
> -transactions between hierarchy domains, and in PCIe, each Root Port
> -defines a separate hierarchy domain. To make things worse, there is no
> -simple way to determine if a given Root Complex supports this or not.
> -(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
> -only supports doing P2P when the endpoints involved are all behind the
> -same PCI bridge, as such devices are all in the same PCI hierarchy
> -domain, and the spec guarantees that all transactions within the
> -hierarchy will be routable, but it does not require routing
> -between hierarchies.
> -
> -The second issue is that to make use of existing interfaces in Linux,
> -memory that is used for P2P transactions needs to be backed by struct
> -pages. However, PCI BARs are not typically cache coherent so there are
> -a few corner case gotchas with these pages so developers need to
> -be careful about what they do with them.
> +For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
> +until they reach a host bridge or root port. If the path includes PCIe switches
> +then based on the ACS settings the transaction can route entirely within
> +the PCIe hierarchy and never reach the root port. The kernel will evaluate
> +the PCIe topology and always permit P2P in these well-defined cases.
> +
> +However, if the P2P transaction reaches the host bridge then it might have to
> +hairpin back out the same root port, be routed inside the CPU SOC to another
> +PCIe root port, or routed internally to the SOC.
Please keep the reference to the PCIe specification where that behavior is defined somewhere here. E.g. "See PCIe r4.0, sec 1.3.1".
> +
> +As this is not well-defined or well-supported in real HW the kernel defaults to
> +blocking such routing. There is an allow list to allow detecting known-good HW,
> +in which case P2P between any two PCIe devices will be permitted.
That section sounds not correct to me. This is well supported in current HW, it's just not defined in some official specification.
> +
> +Since P2P inherently is doing transactions between two devices it requires two
> +drivers to be co-operating inside the kernel. The providing driver has to convey
> +its MMIO to the consuming driver. To meet the driver model lifecycle rules the
> +MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
> +table mappings undone before the providing driver completes remove().
> +
> +This requires the providing and consuming driver to actively work together to
> +guarantee that the consuming driver has stopped using the MMIO during a removal
> +cycle. This is done by either a synchronous invalidation shutdown or waiting
> +for all usage refcounts to reach zero.
> +
> +At the lowest level the P2P subsystem offers a naked struct p2p_provider that
> +delegates lifecycle management to the providing driver. It is expected that
> +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
> +to provide an invalidation shutdown.
> These MMIO pages have no struct page, and
Well please drop "pages" here. Just say MMIO addresses.
> +if used with mmap() must create special PTEs. As such there are very few
> +kernel uAPIs that can accept pointers to them; in particular they cannot be used
> +with read()/write(), including O_DIRECT.
> +
> +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
> +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
> +pgmap ensures that when the pgmap is destroyed all other drivers have stopped
> +using the MMIO. This option works with O_DIRECT flows, in some cases, if the
> +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
> +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
> +it also relies on architecture support along with alignment and minimum size
> +limitations.
Actually that is up to the exporter of the DMA-buf what approach is used.
For the P2PDMA API it should be irrelevant if struct pages are used or not.
So I think you should potentially completely drop that description here.
>
>
> Driver Writer's Guide
> @@ -114,14 +139,38 @@ allocating scatter-gather lists with P2P memory.
> Struct Page Caveats
> -------------------
>
> -Driver writers should be very careful about not passing these special
> -struct pages to code that isn't prepared for it. At this time, the kernel
> -interfaces do not have any checks for ensuring this. This obviously
> -precludes passing these pages to userspace.
> +While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
> +pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
>
> -P2P memory is also technically IO memory but should never have any side
> -effects behind it. Thus, the order of loads and stores should not be important
> -and ioreadX(), iowriteX() and friends should not be necessary.
> +The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
> +KVA is still MMIO and must still be accessed through the normal
> +readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
> +like any other MMIO mapping. While this will actually work on some
> +architectures, others will experience corruption or just crash in the kernel.
> +Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
> +access happens.
> +
> +
> +Usage With DMABUF
> +=================
> +
> +DMABUF provides an alternative to the above struct page-based
> +client/provider/orchestrator system. In this mode the exporting driver will wrap
> +some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
> +
> +Userspace can then pass the FD to an importing driver which will ask the
> +exporting driver to map it.
"to map it to the importer".
Regards,
Christian.
> +
> +In this case the initiator and target pci_devices are known and the P2P subsystem
> +is used to determine the mapping type. The phys_addr_t-based DMA API is used to
> +establish the dma_addr_t.
> +
> +Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
> +to remove() it must deliver an invalidation shutdown to all DMABUF importing
> +drivers through move_notify() and synchronously DMA unmap all the MMIO.
> +
> +No importing driver can continue to have a DMA map to the MMIO after the
> +exporting driver has destroyed its p2p_provider.
>
>
> P2P DMA Support Library
>
Powered by blists - more mailing lists