[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVv0ALacPukXIHTw@skinsburskii.localdomain>
Date: Mon, 5 Jan 2026 09:25:20 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>,
"longli@...rosoft.com" <longli@...rosoft.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mshv: Align huge page stride with guest mapping
On Sat, Jan 03, 2026 at 01:16:51AM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Friday, January 2, 2026 3:35 PM
> >
> > On Fri, Jan 02, 2026 at 09:13:31PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Friday, January 2, 2026 12:03 PM
> > > >
> > > > On Fri, Jan 02, 2026 at 06:04:56PM +0000, Michael Kelley wrote:
> > > > > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Friday, January 2, 2026 9:43 AM
> > > > > >
> > > > > > On Tue, Dec 23, 2025 at 07:17:23PM +0000, Michael Kelley wrote:
> > > > > > > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
> > > > > > > >
> > > > > > > > On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > > > > > > > > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > > > > > > > > >
> > > > > > > > > [snip]
> > > > > > > > > >
> > > > > > > > > > Separately, in looking at this, I spotted another potential problem with
> > > > > > > > > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > > > > > > > > not clear on. To create a new region, the user space VMM issues the
> > > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > > > > > > > > size, and the guest PFN. The only requirement on these values is that the
> > > > > > > > > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > > > > > > > > specified where the userspace address and the guest PFN have different
> > > > > > > > > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > > > > > > > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > > > > > > > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > > > > > > > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > > > > > > > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > > > > > > > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > > > > > > > > to creating 4K mappings, or does it return an error? Returning an error would
> > > > > > > > > > seem to be problematic for movable pages because the error wouldn't
> > > > > > > > > > occur until the guest VM is running and takes a range fault on the region.
> > > > > > > > > > Silently falling back to creating 4K mappings has performance implications,
> > > > > > > > > > though I guess it would work. My question is whether the
> > > > > > > > > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > > > > > > > > error immediately.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > In thinking about this more, I can answer my own question about the
> > > > > > > > > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > > > > > > > > list of 4K system PFNs is not provided as an input to the hypercall, so
> > > > > > > > > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > > > > > > > > sequential PFNs would be wrong, so it must return an error if the
> > > > > > > > > alignment of a system PFN isn't on a 2 Meg boundary.
> > > > > > > > >
> > > > > > > > > For a pinned region, this error happens in mshv_region_map() as
> > > > > > > > > called from mshv_prepare_pinned_region(), so will propagate back
> > > > > > > > > to the ioctl. But the error happens only if pin_user_pages_fast()
> > > > > > > > > allocates one or more 2 Meg pages. So creating a pinned region
> > > > > > > > > where the guest PFN and userspace address have different offsets
> > > > > > > > > modulo 2 Meg might or might not succeed.
> > > > > > > > >
> > > > > > > > > For a movable region, the error probably can't occur.
> > > > > > > > > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > > > > > > > > around the faulting guest PFN. mshv_region_range_fault() then
> > > > > > > > > determines the corresponding userspace addr, which won't be on
> > > > > > > > > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > > > > > > > > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > > > > > > > > always do 4K mappings and will succeed. The downside is that a
> > > > > > > > > movable region with a guest PFN and userspace address with
> > > > > > > > > different offsets never gets any 2 Meg pages or mappings.
> > > > > > > > >
> > > > > > > > > My conclusion is the same -- such misalignment should not be
> > > > > > > > > allowed when creating a region that has the potential to use 2 Meg
> > > > > > > > > pages. Regions less than 2 Meg in size could be excluded from such
> > > > > > > > > a requirement if there is benefit in doing so. It's possible to have
> > > > > > > > > regions up to (but not including) 4 Meg where the alignment prevents
> > > > > > > > > having a 2 Meg page, and those could also be excluded from the
> > > > > > > > > requirement.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm not sure I understand the problem.
> > > > > > > > There are three cases to consider:
> > > > > > > > 1. Guest mapping, where page sizes are controlled by the guest.
> > > > > > > > 2. Host mapping, where page sizes are controlled by the host.
> > > > > > >
> > > > > > > And by "host", you mean specifically the Linux instance running in the
> > > > > > > root partition. It hosts the VMM processes and creates the memory
> > > > > > > regions for each guest.
> > > > > > >
> > > > > > > > 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
> > > > > > > >
> > > > > > > > The first case is not relevant here and is included for completeness.
> > > > > > >
> > > > > > > Agreed.
> > > > > > >
> > > > > > > >
> > > > > > > > The second and third cases (host and hypervisor) share the memory layout,
> > > > > > >
> > > > > > > Right. More specifically, they are both operating on the same set of physical
> > > > > > > memory pages, and hence "share" a set of what I've referred to as
> > > > > > > "system PFNs" (to distinguish from guest PFNs, or GFNs).
> > > > > > >
> > > > > > > > but it is up
> > > > > > > > to each entity to decide which page sizes to use. For example, the host might map the
> > > > > > > > proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
> > > > > > >
> > > > > > > Agreed.
> > > > > > >
> > > > > > > > In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> > > > > > > > can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
> > > > > > >
> > > > > > > Yes, that's possible, but subject to significant requirements. A 2M page can be
> > > > > > > used only if the underlying physical memory is a physically contiguous 2M chunk.
> > > > > > > Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
> > > > > > > and the virtual address to which it is being mapped must be on a 2M boundary.
> > > > > > > In the case of the host, that virtual address is the user space address in the
> > > > > > > user space process. In the case of the hypervisor, that "virtual address" is the
> > > > > > > the location in guest physical address space; i.e., the guest PFN left-shifted 9
> > > > > > > to be a guest physical address.
> > > > > > >
> > > > > > > These requirements are from the physical processor and its requirements on
> > > > > > > page table formats as specified by the hardware architecture. Whereas the
> > > > > > > page table entry for a 4K page contains the entire PFN, the page table entry
> > > > > > > for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
> > > > > > > which is equivalent to requiring that the PFN be on a 2M boundary. These
> > > > > > > requirements apply to both host and hypervisor mappings.
> > > > > > >
> > > > > > > When MSHV code in the host creates a new pinned region via the ioctl,
> > > > > > > MSHV code first allocates memory for the region using pin_user_pages_fast(),
> > > > > > > which returns the system PFN for each page of physical memory that is
> > > > > > > allocated. If the host, at its discretion, allocates a 2M page, then a series
> > > > > > > of 512 sequential 4K PFNs is returned for that 2M page, and the first of
> > > > > > > the 512 sequential PFNs must have its low order 9 bits be zero.
> > > > > > >
> > > > > > > Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
> > > > > > > the hypervisor to map the allocated memory into the guest physical
> > > > > > > address space at a particular guest PFN. If the allocated memory contains
> > > > > > > a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
> > > > > > > causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
> > > > > > > the hypervisor do that mapping as a 2M large page. The hypercall does not
> > > > > > > have the option of dropping back to 4K page mappings in this case. If
> > > > > > > the 2M alignment of the system PFN is different from the 2M alignment
> > > > > > > of the target guest PFN, it's not possible to create the mapping and the
> > > > > > > hypercall fails.
> > > > > > >
> > > > > > > The core problem is that the same 2M of physical memory wants to be
> > > > > > > mapped by the host as a 2M page and by the hypervisor as a 2M page.
> > > > > > > That can't be done unless the host alignment (in the VMM virtual address
> > > > > > > space) and the guest physical address (i.e., the target guest PFN) alignment
> > > > > > > match and are both on 2M boundaries.
> > > > > > >
> > > > > >
> > > > > > But why is it a problem? If both the host and the hypervisor can map ap
> > > > > > huge page, but the guest can't, it's still a win, no?
> > > > > > In other words, if VMM passes a host huge page aligned region as a guest
> > > > > > unaligned, it's a VMM problem, not a hypervisor problem. And I' don't
> > > > > > understand why would we want to prevent such cases.
> > > > > >
> > > > >
> > > > > Fair enough -- mostly. If you want to allow the misaligned case and live
> > > > > with not getting the 2M mapping in the guest, that works except in the
> > > > > situation that I described above, where the HVCALL_MAP_GPA_PAGES
> > > > > hypercall fails when creating a pinned region.
> > > > >
> > > > > The failure is flakey in that if the Linux in the root partition does not
> > > > > map any of the region as a 2M page, the hypercall succeeds and the
> > > > > MSHV_GET_GUEST_MEMORY ioctl succeeds. But if the root partition
> > > > > happens to map any of the region as a 2M page, the hypercall will fail,
> > > > > and the MSHV_GET_GUEST_MEMORY ioctl will fail. Presumably such
> > > > > flakey behavior is bad for the VMM.
> > > > >
> > > > > One solution is that mshv_chunk_stride() must return a stride > 1 only
> > > > > if both the gfn (which it currently checks) AND the corresponding
> > > > > userspace_addr are 2M aligned. Then the HVCALL_MAP_GPA_PAGES
> > > > > hypercall will never have HV_MAP_GPA_LARGE_PAGE set for the
> > > > > misaligned case, and the failure won't occur.
> > > > >
> > > >
> > > > I think see your point, but I also think this issue doesn't exist,
> > > > because map_chunk_stride() returns huge page stride iff page if:
> > > > 1. the folio order is PMD_ORDER and
> > > > 2. GFN is huge page aligned and
> > > > 3. number of 4K pages is huge pages aligned.
> > > >
> > > > On other words, a host huge page won't be mapped as huge if the page
> > > > can't be mapped as huge in the guest.
> > >
> > > OK, I'm missing how what you say is true. For pinned regions,
> > > the memory is allocated and mapped into the host userspace address
> > > first, as done by mshv_prepare_pinned_region() calling mshv_region_pin(),
> > > which calls pin_user_pages_fast(). This is all done without considering
> > > the GFN or GFN alignment. So one or more 2M pages might be allocated
> > > and mapped in the host before any guest mapping is looked at. Agreed?
> > >
> >
> > Agreed.
> >
> > > Then mshv_prepare_pinned_region() calls mshv_region_map() to do the
> > > guest mapping. This eventually gets down to mshv_chunk_stride(). In
> > > mshv_chunk_stride() when your conditions #2 and #3 are met, the
> > > corresponding struct page argument to mshv_chunk_stride() may be a
> > > 4K page that is in the middle of a 2M page instead of at the beginning
> > > (if the region is mis-aligned). But the key point is that the 4K page in
> > > the middle is part of a folio that will return a folio order of PMD_ORDER.
> > > I.e., a folio order of PMD_ORDER is not sufficient to ensure that the
> > > struct page arg is at the *start* of a 2M-aligned physical memory range
> > > that can be mapped into the guest as a 2M page.
> > >
> >
> > I'm trying to undestand how this can even happen, so please bear with
> > me.
> > In other words (and AFAIU), what you are saying in the following:
> >
> > 1. VMM creates a mapping with a huge page(s) (this implies that virtual
> > address is huge page aligned, size is huge page aligned and physical
> > pages are consequtive).
> > 2. VMM tries to create a region via ioctl, but instead of passing the
> > start of the region, is passes an offset into one of the the region's
> > huge pages, and in the same time with the base GFN and the size huge
> > page aligned (to meet the #2 and #3 conditions).
> > 3. mshv_chunk_stride() sees a folio order of PMD_ORDER, and tries to map
> > the corresponding pages as huge, which will be rejected by the
> > hypervisor.
> >
> > Is this accurate?
>
> Yes, pretty much. In Step 1, the VMM may just allocate some virtual
> address space, and not do anything to populate it with physical pages.
> So populating with any 2M pages may not happen until Step 2 when
> the ioctl calls pin_user_pages_fast(). But *when* the virtual address
> space gets populated with physical pages doesn't really matter. We
> just know that it happens before the ioctl tries to map the memory
> into the guest -- i.e., mshv_prepare_pinned_region() calls
> mshv_region_map().
>
> And yes, the problem is what you call out in Step 2: as input to the
> ioctl, the fields "userspace_addr" and "guest_pfn" in struct
> mshv_user_mem_region could have different alignments modulo 2M
> boundaries. When they are different, that's what I'm calling a "mis-aligned
> region", (referring to a struct mshv_mem_region that is created and
> setup by the ioctl).
>
> > A subseqeunt question: if it is accurate, why the driver needs to
> > support this case? It looks like a VMM bug to me.
>
> I don't know if the driver needs to support this case. That's a question
> for the VMM people to answer. I wouldn't necessarily assume that the
> VMM always allocates virtual address space with exactly the size and
> alignment that matches the regions it creates with the ioctl. The
> kernel ioctl doesn't care how the VMM allocates and manages its
> virtual address space, so the VMM is free to do whatever it wants
> in that regard, as long as it meets the requirements of the ioctl. So
> the requirements of the ioctl in this case are something to be
> negotiated with the VMM.
>
> > Also, how should it support it? By rejecting such requests in the ioctl?
>
> Rejecting requests to create a mis-aligned region is certainly one option
> if the VMM agrees that's OK. The ioctl currently requires only that
> "userspace_addr" and "size" be page aligned, so those requirements
> could be tightened.
>
> The other approach is to fix mshv_chunk_stride() to handle the
> mis-aligned case. Doing so it even easier than I first envisioned.
> I think this works:
>
> @@ -49,7 +49,8 @@ static int mshv_chunk_stride(struct page *page,
> */
> if (page_order &&
> IS_ALIGNED(gfn, PTRS_PER_PMD) &&
> - IS_ALIGNED(page_count, PTRS_PER_PMD))
> + IS_ALIGNED(page_count, PTRS_PER_PMD) &&
> + IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD))
> return 1 << page_order;
>
> return 1;
>
> But as we discussed earlier, this fix means never getting 2M mappings
> in the guest for a region that is mis-aligned.
>
Although I understand the logic behind this fix, I’m hesitant to add it
because it looks like a workaround for a VMM bug that could bite back.
The approach you propose will silently map a huge page as a collection
of 4K pages, impacting guest performance (this will be especially
visible for a region containing a single huge page).
This fix silently allows such behavior instead of reporting it as an
error to user space. It’s worth noting that pinned-region population and
mapping happen upon ioctl invocation, so the VMM will either get an
error from the hypervisor (current behavior) or get a region mapped with
4K pages (proposed behavior).
The first case is an explicit error; the second — although it allows
adding a region — will be less performant, significantly increase region
mapping time and thus potentailly guest spin-up (creation) time, and be
less noticeable to customers, especially those who don’t really
understand what’s happening under the hood and simply stumbled upon some
VMM bug.
What’s your take?
Thanks,
Stanislav
> Michael
>
> >
> > Thanks,
> > Stanislav
> >
> > > The problem does *not* happen with a movable region, but the reasoning
> > > is different. hmm_range_fault() is always called with a 2M range aligned
> > > to the GFN, which in a mis-aligned region means that the host userspace
> > > address is never 2M aligned. So hmm_range_fault() is never able to allocate
> > > and map a 2M page. mshv_chunk_stride() will never get a folio order > 1,
> > > and the hypercall is never asked to do a 2M mapping. Both host and guest
> > > mappings will always be 4K and everything works.
> > >
> > > Michael
> > >
> > > > And this function is called for
> > > > both movable and pinned region, so the hypercal should never fail due to
> > > > huge page alignment issue.
> > > >
> > > > What do I miss here?
> > > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > >
> > > > > Michael
> > > > >
> > > > > >
> > > > > > > Movable regions behave a bit differently because the memory for the
> > > > > > > region is not allocated on the host "up front" when the region is created.
> > > > > > > The memory is faulted in as the guest runs, and the vagaries of the current
> > > > > > > MSHV in Linux code are such that 2M pages are never created on the host
> > > > > > > if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
> > > > > > > to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
> > > > > > > mappings, which works even with the misalignment.
> > > > > > >
> > > > > > > >
> > > > > > > > This adjustment happens at runtime. Could this be the missing detail here?
> > > > > > >
> > > > > > > Adjustments at runtime are a different topic from the issue I'm raising,
> > > > > > > though eventually there's some relationship. My issue occurs in the
> > > > > > > creation of a new region, and the setting up of the initial hypervisor
> > > > > > > mapping. I haven't thought through the details of adjustments at runtime.
> > > > > > >
> > > > > > > My usual caveats apply -- this is all "thought experiment". If I had the
> > > > > > > means do some runtime testing to confirm, I would. It's possible the
> > > > > > > hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
> > > > > > > that given the basics of how physical processors work with page tables.
> > > > > > >
> > > > > > > Michael
Powered by blists - more mailing lists