[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aUGCPN2ngvWMG2Ta@x1.local>
Date: Tue, 16 Dec 2025 11:01:00 -0500
From: Peter Xu <peterx@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Nico Pache <npache@...hat.com>, Zi Yan <ziy@...dia.com>,
Alex Mastro <amastro@...com>, David Hildenbrand <david@...hat.com>,
Alex Williamson <alex@...zbot.org>, Zhi Wang <zhiw@...dia.com>,
David Laight <david.laight.linux@...il.com>,
Yi Liu <yi.l.liu@...el.com>, Ankit Agrawal <ankita@...dia.com>,
Kevin Tian <kevin.tian@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 4/4] vfio-pci: Best-effort huge pfnmaps with
!MAP_FIXED mappings
On Tue, Dec 16, 2025 at 10:42:24AM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 10, 2025 at 03:43:43PM -0500, Peter Xu wrote:
> > > This seems a bit weird, the vma length is already known, it is len,
> > > why do we go to all this trouble to recalculate len in terms of phys?
> > >
> > > If the length is wrong the mmap will fail, so there is no issue with
> > > returning a larger order here.
> > >
> > > I feel this should just return the order based on pci_resource_len()?
> >
> > IIUC there's a trivial difference when partial of a huge bar is mapped.
> >
> > Example: 1G bar, map range (pgoff=2M, size=1G-2M).
> >
> > If we return bar size order, we'd say 1G, however then it means we'll do
> > the alignment with 1G. __thp_get_unmapped_area() will think it's not
> > proper, because:
> >
> > loff_t off_end = off + len;
> > loff_t off_align = round_up(off, size);
> >
> > if (off_end <= off_align || (off_end - off_align) < size)
> > return 0;
> >
> > Here what we really want is to map (2M, 1G-2M) with 2M huge, not 1G, nor
> > 4K.
>
> This was the point of my prior email, the alignment calculation can't
> just be 'align to a size'. The core code needs to choose a VA such
> that:
IMHO these are two different things we're discussing here. I've replied in
the other email about pgoff alignments, I think it's properly done, let's
keep the discussion there. I'll reply to the other issue raised.
>
> VA % (1 << order) == pg_off % (1 << order)
>
> So if VFIO returns 1G then the VA should be aligned to 2M within a 1G
> region. This allows opportunities to increase the alignment as the
> mapping continues, eg if the arch supports a 32M 16x2M contiguous page
> then we'd get 32M mappings with your example.
>
> The core code should adjust the target order from the driver by:
> lowest of order or size rounded down to a power of two
> then
> highest arch supported leaf size below the above
Yes, maybe this would be better.
E.g. I would expect if a driver has 32M returned (order=13), then on x86_64
it should be adjusted to 2M (order=9), but on a 4K pgsize arm64 it should
be kept as 32M (order=13) as it matches contpmds.
Do we have any function that we can fetch the best mapping lower than a
specific order?
>
> None of this logic should be in drivers.
I still think it's the driver's decision to have its own macro controlling
the huge pfnmap behavior. I agree with you core mm can have it, I don't
see it blocks the driver not returning huge order if huge pfnmap is turned
off. VFIO-PCI currently indeed only depends directly on global THP
configs, but I don't see why it's strictly needed. So I think it's fine if
a driver (even if global THP enabled for pmd/pud) deselect huge pfnmap for
other reasons, then here the order returned can still always be PSIZE for
the driver. It's really not a huge deal to me.
>
> The way to think about this is that the driver is returning an order
> which indicates the maximum case where:
>
> VA % (1 << order) == pg_off % (1 << order)
>
> Could be true. Eg a PCI BAR returns an order that is the size of the
> BAR because it is always true. Something that stores at most 1G pages
> would return 1G pages, etc.
>
> > Note that here checking CONFIG_ARCH_SUPPORTS_P*D_PFNMAP is a vfio behavior,
> > pairing with the huge_fault() of vfio-pci driver. It implies if vfio-pci's
> > huge pfnmap is enabled or not. If it's not enabled, we don't need to
> > report larger orders here.
>
> Honestly, I'd ignore this, and I'm not sure VFIO should be testing
> those in the huge_fault either. Again the core code should deal with
> it.
>
> > Shall I keep it simple to leave it to drivers, until we have something more
> > solid (I think we need HAVE_ARCH_HUGE_P*D_LEAVES here)?
>
> Logic like this should not be in drivers.
>
> > Even with that config ready, drivers should always still do proper check on
> > its own (drivers need to support huge pfnmaps here first before reporting
> > high orders).
>
> Drivers shouldn't implement this alignment function without also
> implementing huge fault, it is pointless. Don't see a reason to add
> extra complexity.
It's not implementing the order hint without huge fault. It's when both
are turned off in a kernel config.. then the order hint (even from driver
POV) shouldn't need to be reported.
I don't know why you have so strong feeling on having a config check in
vfio-pci drivers is bad. I still think it's good to have it pairing with
the same macro in huge_fault(), because it's essentially part of the whole
pfnmap feature so it's fair they're guarded by the same kernel config, but
I'm ok either way in case of current case of vfio-pci where it 100% depends
on global THP setups.
--
Peter Xu
Powered by blists - more mailing lists