linux-kernel - Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aFtHbXFO1ZpAsnV8@x1.local>
Date: Tue, 24 Jun 2025 20:48:45 -0400
From: Peter Xu <peterx@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	kvm@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
	Alex Williamson <alex.williamson@...hat.com>,
	Zi Yan <ziy@...dia.com>, Alex Mastro <amastro@...com>,
	David Hildenbrand <david@...hat.com>,
	Nico Pache <npache@...hat.com>
Subject: Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED
 mappings

On Tue, Jun 24, 2025 at 08:40:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> > On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > > Even with this new version you have to decide to return PUD_SIZE or
> > > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > > applies (though I would probably return bar_size and just let the core
> > > code cap it to PUD_SIZE)
> > 
> > Yes.
> > 
> > Today I went back to look at this, I was trying to introduce this for
> > file_operations:
> > 
> > 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> > 
> > It looks almost good, except that it so far has no way to return the
> > physical address for further calculation on the alignment.
> > 
> > For THP, VA is always calculated against pgoff not physical address on the
> > alignment.  I think it's OK for THP, because every 2M THP folio will be
> > naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> > in the calculation of thp_get_unmapped_area_vmflags().
> > 
> > Logically it should even also work for vfio-pci, as long as VFIO keeps
> > using the lower 40 bits of the device_fd to represent the bar offset,
> > meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> > aligned with bar sizes.
> > 
> > But from an API POV, get_mapping_order() logically should return something
> > for further calculation of the alignment to get the VA.  pgoff here may not
> > always be the right thing to use to align to the VA: after all, pgtable
> > mapping is about VA -> PA, the only reasonable and reliable way is to align
> > VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> > always aligned to PA address space.
> 
> My feeling, and the reason I used the phrase "pgoff aligned address",
> is that the owner of the file should already ensure that for the large
> PTEs/folios:
>  pgoff % 2**order == 0
>  physical % 2**order == 0

IMHO there shouldn't really be any hard requirement in mm that pgoff and
physical address space need to be aligned.. but I confess I don't have an
example driver that didn't do that in the linux tree.

> 
> So, things like VFIO do need to hand out high alignment pgoffs to make
> this work - which it already does.
> 
> To me this just keeps thing simpler. I guess if someone comes up with
> a case where they really can't get a pgoff alignment and really need a
> high order mapping then maybe we can add a new return field of some
> kind (pgoff adjustment?) but that is so weird I'd leave it to the
> future person to come and justfiy it.

When looking more, I also found some special cased get_unmapped_area() that
may not be trivially converted into the new API even for CONFIG_MMU, namely:

- io_uring_get_unmapped_area
- arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)

I'll need to have some closer look tomorrow.  If any of them cannot be 100%
safely converted to the new API, I'd also think we should not introduce the
new API, but reuse get_unmapped_area() until we know a way out.

-- 
Peter Xu