[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB41573BF52C6A4447C720CDD6D4B5A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Tue, 23 Dec 2025 19:17:23 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>, "longli@...rosoft.com"
<longli@...rosoft.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
>
> On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > >
> > [snip]
> > >
> > > Separately, in looking at this, I spotted another potential problem with
> > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > not clear on. To create a new region, the user space VMM issues the
> > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > size, and the guest PFN. The only requirement on these values is that the
> > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > specified where the userspace address and the guest PFN have different
> > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > to creating 4K mappings, or does it return an error? Returning an error would
> > > seem to be problematic for movable pages because the error wouldn't
> > > occur until the guest VM is running and takes a range fault on the region.
> > > Silently falling back to creating 4K mappings has performance implications,
> > > though I guess it would work. My question is whether the
> > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > error immediately.
> > >
> >
> > In thinking about this more, I can answer my own question about the
> > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > list of 4K system PFNs is not provided as an input to the hypercall, so
> > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > sequential PFNs would be wrong, so it must return an error if the
> > alignment of a system PFN isn't on a 2 Meg boundary.
> >
> > For a pinned region, this error happens in mshv_region_map() as
> > called from mshv_prepare_pinned_region(), so will propagate back
> > to the ioctl. But the error happens only if pin_user_pages_fast()
> > allocates one or more 2 Meg pages. So creating a pinned region
> > where the guest PFN and userspace address have different offsets
> > modulo 2 Meg might or might not succeed.
> >
> > For a movable region, the error probably can't occur.
> > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > around the faulting guest PFN. mshv_region_range_fault() then
> > determines the corresponding userspace addr, which won't be on
> > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > always do 4K mappings and will succeed. The downside is that a
> > movable region with a guest PFN and userspace address with
> > different offsets never gets any 2 Meg pages or mappings.
> >
> > My conclusion is the same -- such misalignment should not be
> > allowed when creating a region that has the potential to use 2 Meg
> > pages. Regions less than 2 Meg in size could be excluded from such
> > a requirement if there is benefit in doing so. It's possible to have
> > regions up to (but not including) 4 Meg where the alignment prevents
> > having a 2 Meg page, and those could also be excluded from the
> > requirement.
> >
>
> I'm not sure I understand the problem.
> There are three cases to consider:
> 1. Guest mapping, where page sizes are controlled by the guest.
> 2. Host mapping, where page sizes are controlled by the host.
And by "host", you mean specifically the Linux instance running in the
root partition. It hosts the VMM processes and creates the memory
regions for each guest.
> 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
>
> The first case is not relevant here and is included for completeness.
Agreed.
>
> The second and third cases (host and hypervisor) share the memory layout,
Right. More specifically, they are both operating on the same set of physical
memory pages, and hence "share" a set of what I've referred to as
"system PFNs" (to distinguish from guest PFNs, or GFNs).
> but it is up
> to each entity to decide which page sizes to use. For example, the host might map the
> proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
Agreed.
> In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
Yes, that's possible, but subject to significant requirements. A 2M page can be
used only if the underlying physical memory is a physically contiguous 2M chunk.
Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
and the virtual address to which it is being mapped must be on a 2M boundary.
In the case of the host, that virtual address is the user space address in the
user space process. In the case of the hypervisor, that "virtual address" is the
the location in guest physical address space; i.e., the guest PFN left-shifted 9
to be a guest physical address.
These requirements are from the physical processor and its requirements on
page table formats as specified by the hardware architecture. Whereas the
page table entry for a 4K page contains the entire PFN, the page table entry
for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
which is equivalent to requiring that the PFN be on a 2M boundary. These
requirements apply to both host and hypervisor mappings.
When MSHV code in the host creates a new pinned region via the ioctl,
MSHV code first allocates memory for the region using pin_user_pages_fast(),
which returns the system PFN for each page of physical memory that is
allocated. If the host, at its discretion, allocates a 2M page, then a series
of 512 sequential 4K PFNs is returned for that 2M page, and the first of
the 512 sequential PFNs must have its low order 9 bits be zero.
Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
the hypervisor to map the allocated memory into the guest physical
address space at a particular guest PFN. If the allocated memory contains
a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
the hypervisor do that mapping as a 2M large page. The hypercall does not
have the option of dropping back to 4K page mappings in this case. If
the 2M alignment of the system PFN is different from the 2M alignment
of the target guest PFN, it's not possible to create the mapping and the
hypercall fails.
The core problem is that the same 2M of physical memory wants to be
mapped by the host as a 2M page and by the hypervisor as a 2M page.
That can't be done unless the host alignment (in the VMM virtual address
space) and the guest physical address (i.e., the target guest PFN) alignment
match and are both on 2M boundaries.
Movable regions behave a bit differently because the memory for the
region is not allocated on the host "up front" when the region is created.
The memory is faulted in as the guest runs, and the vagaries of the current
MSHV in Linux code are such that 2M pages are never created on the host
if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
mappings, which works even with the misalignment.
>
> This adjustment happens at runtime. Could this be the missing detail here?
Adjustments at runtime are a different topic from the issue I'm raising,
though eventually there's some relationship. My issue occurs in the
creation of a new region, and the setting up of the initial hypervisor
mapping. I haven't thought through the details of adjustments at runtime.
My usual caveats apply -- this is all "thought experiment". If I had the
means do some runtime testing to confirm, I would. It's possible the
hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
that given the basics of how physical processors work with page tables.
Michael
Powered by blists - more mailing lists