[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <IA0PR11MB71855A080F43E4D657A70311F859A@IA0PR11MB7185.namprd11.prod.outlook.com>
Date: Fri, 25 Jul 2025 05:34:40 +0000
From: "Kasireddy, Vivek" <vivek.kasireddy@...el.com>
To: Leon Romanovsky <leon@...nel.org>
CC: Alex Williamson <alex.williamson@...hat.com>, Christoph Hellwig
<hch@....de>, Jason Gunthorpe <jgg@...dia.com>, Andrew Morton
<akpm@...ux-foundation.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
Christian König <christian.koenig@....com>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>, Jens Axboe
<axboe@...nel.dk>, Jérôme Glisse <jglisse@...hat.com>,
Joerg Roedel <joro@...tes.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linaro-mm-sig@...ts.linaro.org" <linaro-mm-sig@...ts.linaro.org>,
"linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-pci@...r.kernel.org"
<linux-pci@...r.kernel.org>, Logan Gunthorpe <logang@...tatee.com>, "Marek
Szyprowski" <m.szyprowski@...sung.com>, Robin Murphy <robin.murphy@....com>,
Sumit Semwal <sumit.semwal@...aro.org>, Will Deacon <will@...nel.org>
Subject: RE: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
regions
Hi Leon,
> Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> regions
>
> > >
> > > From: Leon Romanovsky <leonro@...dia.com>
> > >
> > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > enabling safe sharing of non-struct page memory with controlled
> > > lifetime management. This allows RDMA and other subsystems to
> import
> > > dma-buf FDs and build them into memory regions for PCI P2P
> operations.
> > >
> > > The implementation provides a revocable attachment mechanism using
> > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > don't change physical addresses, but access is revoked when the VFIO
> > > device is closed or a PCI reset is issued. This ensures kernel
> > > self-defense against potentially hostile userspace.
> > >
> > > Signed-off-by: Jason Gunthorpe <jgg@...dia.com>
> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@...el.com>
> > > Signed-off-by: Leon Romanovsky <leonro@...dia.com>
> > > ---
> > > drivers/vfio/pci/Kconfig | 20 ++
> > > drivers/vfio/pci/Makefile | 2 +
> > > drivers/vfio/pci/vfio_pci_config.c | 22 +-
> > > drivers/vfio/pci/vfio_pci_core.c | 25 ++-
> > > drivers/vfio/pci/vfio_pci_dmabuf.c | 321
> +++++++++++++++++++++++++++++
> > > drivers/vfio/pci/vfio_pci_priv.h | 23 +++
> > > include/linux/dma-buf.h | 1 +
> > > include/linux/vfio_pci_core.h | 3 +
> > > include/uapi/linux/vfio.h | 19 ++
> > > 9 files changed, 431 insertions(+), 5 deletions(-)
> > > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
>
> <...>
>
> > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > > + struct vfio_device_feature_dma_buf
> *dma_buf)
> > > +{
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + u32 bar = dma_buf->region_index;
> > > + u64 offset = dma_buf->offset;
> > > + u64 len = dma_buf->length;
> > > + resource_size_t bar_size;
> > > + u64 sum;
> > > +
> > > + /*
> > > + * For PCI the region_index is the BAR number like everything else.
> > > + */
> > > + if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > > + return -ENODEV;
>
> <...>
>
> > > +/**
> > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > + * regions selected.
> > > + *
> > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > > O_CLOEXEC,
> > > + * etc. offset/length specify a slice of the region to create the dmabuf
> from.
> > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > > dmabuf.
> > Any particular reason why you dropped the option (nr_ranges) of creating
> a
> > single dmabuf from multiple ranges of an MMIO region?
>
> I did it for two reasons. First, I wanted to simplify the code in order
> to speed-up discussion over the patchset itself. Second, I failed to
> find justification for need of multiple ranges, as the number of BARs
> are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
> can be achieved by multiple calls to DMABUF import.
I don't think the same functionality can be achieved by multiple calls to
dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can
have multiple entries because it represents a scattered buffer (multiple
non-contiguous entries in System RAM or an MMIO region). But in this
patch you are constraining it such that only one entry associated with a
buffer would be included, which effectively means that we cannot create
a dmabuf to represent scattered buffers (located in a single MMIO region
such as VRAM or other device memory) anymore.
>
> >
> > Restricting the dmabuf to a single range (or having to create multiple
> dmabufs
> > to represent multiple regions/ranges associated with a single scattered
> buffer)
> > would be very limiting and may not work in all cases. For instance, in my
> use-case,
> > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's
> VRAM
> > between two (p2p compatible) GPU devices. And, this would probably not
> work
> > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may
> not be
> > feasible when there is memory pressure.
>
> Can you please help me and point to the place in the code where this can
> fail?
> I'm probably missing something basic as there are no large allocations
> in the current patchset.
Sorry, I was not very clear. What I meant is that it is not prudent to assume that
there will only be one range associated with an MMIO region which we need to
consider while creating a dmabuf. And, I was pointing out my use-case as an
example where vfio-pci needs to create a dmabuf for a large buffer (FB) that
would likely be scattered (and not contiguous) in an MMIO region (such as VRAM).
Let me further explain with my use-case. Here is a link to my Qemu-based test:
https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f060f175ad1c89e95#81dc845f0babf39649c4e086e173375614111b4a_29_46
While exhaustively testing this case, I noticed that the Guest VM (GPU driver)
would occasionally create the buffer (represented by virtio_gpu_simple_resource,
for which we need to create a dmabuf) in such a way that there are multiple
entries (indicated by res->iov_cnt) that need to be included. This is the main
reason why I added support for nr_ranges > 1 to this patch/feature.
Furthermore, creating multiple dmabufs to represent each range of the same
buffer, like you suggest IIUC is suboptimal and does not align with how dmabuf
works currently.
>
> >
> > Furthermore, since you are adding a new UAPI with this patch/feature, as
> you know,
> > we cannot go back and tweak it (to add support for nr_ranges > 1) should
> there
> > be a need in the future, but you can always use nr_ranges = 1 anytime.
> Therefore,
> > I think it makes sense to be flexible in terms of the number of ranges to
> include
> > while creating a dmabuf instead of restricting ourselves to one range.
>
> I'm not a big fan of over-engineering. Let's first understand if this
> case is needed.
As explained above with my use-case, having support for nr_ranges > 1 is not
just nice to have but absolutely necessary. Otherwise, this feature would be
constrained to creating dmabufs for contiguous buffers (nr_ranges = 1) only,
which would limit its effectiveness as most GPU buffers are rarely contiguous.
Thanks,
Vivek
>
> Thanks
>
> >
> > Thanks,
> > Vivek
> >
> > > + *
> > > + * Return: The fd number on success, -1 and errno is set on failure.
> > > + */
> > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > > +
> > > +struct vfio_device_feature_dma_buf {
> > > + __u32 region_index;
> > > + __u32 open_flags;
> > > + __u64 offset;
> > > + __u64 length;
> > > +};
> > > +
> > > /* -------- API for Type1 VFIO IOMMU -------- */
> > >
> > > /**
> > > --
> > > 2.50.1
> >
Powered by blists - more mailing lists