linux-kernel - Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEx6Qyl3cgiarXZD@x1.local>
Date: Fri, 13 Jun 2025 15:21:39 -0400
From: Peter Xu <peterx@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, kvm@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Alex Williamson <alex.williamson@...hat.com>,
	Zi Yan <ziy@...dia.com>, Jason Gunthorpe <jgg@...dia.com>,
	Alex Mastro <amastro@...com>, Nico Pache <npache@...hat.com>
Subject: Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED
 mappings

On Fri, Jun 13, 2025 at 08:09:41PM +0200, David Hildenbrand wrote:
> On 13.06.25 15:41, Peter Xu wrote:
> > This patch enables best-effort mmap() for vfio-pci bars even without
> > MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
> > also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
> > addresses) to start enabling huge pfnmaps on VFIO bars.
> > 
> > Here the trick is making sure the MMIO PFNs will be aligned with the VAs
> > allocated from mmap() when !MAP_FIXED, so that whatever returned from
> > mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
> > for huge pfnmaps as much as possible.
> > 
> > To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
> > devices is needed.
> > 
> > Note that MMIO physical addresses should normally be guaranteed to be
> > always bar-size aligned, hence the bar offset can logically be directly
> > used to do the calculation.  However to make it strict and clear (rather
> > than relying on spec details), we still try to fetch the bar's physical
> > addresses from pci_dev.resource[].
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
> 
> There is likely a
> 
> Co-developed-by: Alex Williamson <alex.williamson@...hat.com>
> 
> missing?

Would it mean the same if we use the two SoBs like what this patch uses?
I sincerely don't know the difference..  I hope it's fine to show that this
patch was developed together.  Please let me know otherwise.

> 
> > Signed-off-by: Peter Xu <peterx@...hat.com>
> > ---
> >   drivers/vfio/pci/vfio_pci.c      |  3 ++
> >   drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
> >   include/linux/vfio_pci_core.h    |  6 +++
> >   3 files changed, 74 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 5ba39f7623bb..d9ae6cdbea28 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> >   	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> >   	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
> > +#endif
> >   };
> >   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 6328c3a05bcd..835bc168f8b7 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> >   	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
> >   }
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +/*
> > + * Hint function to provide mmap() virtual address candidate so as to be
> > + * able to map huge pfnmaps as much as possible.  It is done by aligning
> > + * the VA to the PFN to be mapped in the specific bar.
> > + *
> > + * Note that this function does the minimum check on mmap() parameters to
> > + * make the PFN calculation valid only. The majority of mmap() sanity check
> > + * will be done later in mmap().
> > + */
> > +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> > +					      struct file *file,
> > +					      unsigned long addr,
> > +					      unsigned long len,
> > +					      unsigned long pgoff,
> > +					      unsigned long flags)
> 
> A very suboptimal way to indent this many parameters; just use two tabs at
> the beginning.

This is the default indentation from Emacs c-mode.

Since this is a VFIO file, I checked the file and looks like there's not
yet a strict rule of indentation across the whole file.  I can switch to
two-tabs for sure if nobody else disagrees.

> 
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(device, struct vfio_pci_core_device, vdev);
> > +	struct pci_dev *pdev = vdev->pdev;
> > +	unsigned long ret, phys_len, req_start, phys_addr;
> > +	unsigned int index;
> > +
> > +	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> Could do
> 
> unsigned int index =  pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> at the very top.

Sure.

> 
> > +
> > +	/* Currently, only bars 0-5 supports huge pfnmap */
> > +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> > +		goto fallback;
> > +
> > +	/* Bar offset */
> > +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> > +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> > +
> > +	/*
> > +	 * Make sure we at least can get a valid physical address to do the
> > +	 * math.  If this happens, it will probably fail mmap() later..
> > +	 */
> > +	if (req_start >= phys_len)
> > +		goto fallback;
> > +
> > +	phys_len = MIN(phys_len, len);
> > +	/* Calculate the start of physical address to be mapped */
> > +	phys_addr = pci_resource_start(pdev, index) + req_start;
> > +
> > +	/* Choose the alignment */
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PUD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (phys_len >= PMD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PMD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> 
> Similar to Jason, I wonder if that logic should reside in the core, and we
> only indicate the maximum page table level we support.

I replied.  We can continue the discussion there.

Thanks,

-- 
Peter Xu