[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251216144224.GE6079@nvidia.com>
Date: Tue, 16 Dec 2025 10:42:24 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: Peter Xu <peterx@...hat.com>
Cc: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Nico Pache <npache@...hat.com>, Zi Yan <ziy@...dia.com>,
Alex Mastro <amastro@...com>, David Hildenbrand <david@...hat.com>,
Alex Williamson <alex@...zbot.org>, Zhi Wang <zhiw@...dia.com>,
David Laight <david.laight.linux@...il.com>,
Yi Liu <yi.l.liu@...el.com>, Ankit Agrawal <ankita@...dia.com>,
Kevin Tian <kevin.tian@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 4/4] vfio-pci: Best-effort huge pfnmaps with
!MAP_FIXED mappings
On Wed, Dec 10, 2025 at 03:43:43PM -0500, Peter Xu wrote:
> > This seems a bit weird, the vma length is already known, it is len,
> > why do we go to all this trouble to recalculate len in terms of phys?
> >
> > If the length is wrong the mmap will fail, so there is no issue with
> > returning a larger order here.
> >
> > I feel this should just return the order based on pci_resource_len()?
>
> IIUC there's a trivial difference when partial of a huge bar is mapped.
>
> Example: 1G bar, map range (pgoff=2M, size=1G-2M).
>
> If we return bar size order, we'd say 1G, however then it means we'll do
> the alignment with 1G. __thp_get_unmapped_area() will think it's not
> proper, because:
>
> loff_t off_end = off + len;
> loff_t off_align = round_up(off, size);
>
> if (off_end <= off_align || (off_end - off_align) < size)
> return 0;
>
> Here what we really want is to map (2M, 1G-2M) with 2M huge, not 1G, nor
> 4K.
This was the point of my prior email, the alignment calculation can't
just be 'align to a size'. The core code needs to choose a VA such
that:
VA % (1 << order) == pg_off % (1 << order)
So if VFIO returns 1G then the VA should be aligned to 2M within a 1G
region. This allows opportunities to increase the alignment as the
mapping continues, eg if the arch supports a 32M 16x2M contiguous page
then we'd get 32M mappings with your example.
The core code should adjust the target order from the driver by:
lowest of order or size rounded down to a power of two
then
highest arch supported leaf size below the above
None of this logic should be in drivers.
The way to think about this is that the driver is returning an order
which indicates the maximum case where:
VA % (1 << order) == pg_off % (1 << order)
Could be true. Eg a PCI BAR returns an order that is the size of the
BAR because it is always true. Something that stores at most 1G pages
would return 1G pages, etc.
> Note that here checking CONFIG_ARCH_SUPPORTS_P*D_PFNMAP is a vfio behavior,
> pairing with the huge_fault() of vfio-pci driver. It implies if vfio-pci's
> huge pfnmap is enabled or not. If it's not enabled, we don't need to
> report larger orders here.
Honestly, I'd ignore this, and I'm not sure VFIO should be testing
those in the huge_fault either. Again the core code should deal with
it.
> Shall I keep it simple to leave it to drivers, until we have something more
> solid (I think we need HAVE_ARCH_HUGE_P*D_LEAVES here)?
Logic like this should not be in drivers.
> Even with that config ready, drivers should always still do proper check on
> its own (drivers need to support huge pfnmaps here first before reporting
> high orders).
Drivers shouldn't implement this alignment function without also
implementing huge fault, it is pointless. Don't see a reason to add
extra complexity.
Jason
Powered by blists - more mailing lists