[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aUrC94YvscoqBzh3@skinsburskii.localdomain>
Date: Tue, 23 Dec 2025 08:27:35 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>,
"longli@...rosoft.com" <longli@...rosoft.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mshv: Align huge page stride with guest mapping
On Mon, Dec 22, 2025 at 06:25:02PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Friday, December 19, 2025 2:54 PM
> >
> > On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday,
> > December 16, 2025 4:41 PM
> > > >
> > > > Ensure that a stride larger than 1 (huge page) is only used when both
> > > > the guest frame number (gfn) and the operation size (page_count) are
> > > > aligned to the huge page size (PTRS_PER_PMD). This matches the
> > > > hypervisor requirement that map/unmap operations for huge pages must be
> > > > guest-aligned and cover a full huge page.
> > > >
> > > > Add mshv_chunk_stride() to encapsulate this alignment and page-order
> > > > validation, and plumb a huge_page flag into the region chunk handlers.
> > > > This prevents issuing large-page map/unmap/share operations that the
> > > > hypervisor would reject due to misaligned guest mappings.
> > >
> > > This code looks good to me on the surface. But I can only make an educated
> > > guess as to the hypervisor behavior in certain situations, and if my guess is
> > > correct there's still a flaw in one case.
> > >
> > > Consider the madvise() DONTNEED experiment that I previously called out. [1]
> > > I surmise that the intent of this patch is to make that case work correctly.
> > > When the .invalidate callback is made for the 32 Kbyte range embedded in
> > > a previously mapped 2 Meg page, this new code detects that case. It calls the
> > > hypervisor to remap the 32 Kbyte range for no access, and clears the 8
> > > corresponding entries in the struct page array attached to the mshv region. The
> > > call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
> > > Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
> > > flag, my guess is that the hypervisor is smart enough to handle this case by
> > > splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
> > > and returning "success". If my guess is correct, there's no problem here.
> > >
> > > But then there's a second .invalidate callback for the entire 2 Meg page. Here's
> > > the call stack:
> > >
> > > [ 194.259337] dump_stack+0x14/0x20
> > > [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback]
> > > [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250
> > > [ 194.259347] __split_huge_pmd+0x14f/0x170
> > > [ 194.259349] unmap_page_range+0x104d/0x1a00
> > > [ 194.259358] unmap_single_vma+0x7d/0xc0
> > > [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0
> > > [ 194.259363] madvise_vma_behavior+0xb01/0xc00
> > > [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0
> > > [ 194.259375] do_madvise+0xc7/0x170
> > > [ 194.259380] __x64_sys_madvise+0x2f/0x40
> > > [ 194.259382] x64_sys_call+0x1d77/0x21b0
> > > [ 194.259385] do_syscall_64+0x56/0x640
> > > [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
> > > page is actually split by the root partition. So mshv_chunk_stride() returns "9"
> > > for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
> > > set. My guess is that the hypervisor returns an error because it has already
> > > split the mapping. The whole point of this patch set is to avoid passing
> > > HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
> > > is not a large page mapping, but this looks like a case where it still happens.
> > >
> > > My concern is solely from looking at the code and thinking about the problem,
> > > as I don't have an environment where I can test root partition interactions
> > > with the hypervisor. So maybe I'm missing something. Lemme know what you
> > > think .....
> > >
> >
> > Yeah, I see your point: according to this stack, once a part of the page
> > is invalidated, the folio order remains the same until another invocation
> > of the same callback - this time for the whole huge
> > page - is made. Thus, the stride is still reported as the huge page size,
> > even though a part of the page has already been unmapped.
> >
> > This indeed looks like a flaw in the current approach, but it's actually
> > not. The reason is that upon the invalidation callback, the driver
> > simply remaps the whole huge page with no access (in this case, the PFNs
> > provided to the hypervisor are zero), and it's fine as the hypervisor
> > simply drops all the pages from the previous mapping and marks this page
> > as inaccessible. The only check the hypervisor makes in this case is
> > that both the GFN and mapping size are huge page aligned (which they are
> > in this case).
> >
> > I hope this clarifies the situation. Please let me know if you have any
> > other questions.
>
> Thanks. Yes, this clarifies. My guess about the hypervisor behavior was wrong.
> Based on what you've said about what the hypervisor does, and further studying
> MSHV code, here's my recap of the HV_MAP_GPA_LARGE_PAGE flag:
>
> 1. The hypervisor uses the flag to determine the granularity (4K or 2M) of the
> mapping HVCALL_MAP_GPA_PAGES or HVCALL_UNMAP_GPA_PAGES will
> create/remove. As such, the hypercall "repcount" is in this granularity. GFNs,
> such as the target_gpa_base input parameter and GFNs in the pfn_array, are
> always 4K GFNs, but if the flag is set, a GFN is treated as the first 4K GFN in
> a contiguous 2M range. If the flag is set, the target_gpa_base GFN must be
> 2M aligned.
>
> 2. The hypervisor doesn't care whether any existing mapping is 4K or 2M. It
> always removes an existing mapping, including splitting any 2M mappings if
> necessary. Then if the operation is to create/re-create a mapping, it creates
> an appropriate new mapping.
>
> My error was in thinking that the flag had to match any existing mapping.
> But the behavior you've clarified is certainly better. It handles the vagaries
> of the Linux "mm" subsystem, which in one case in my original experiment
> (madvise) invalidates the small range, then the 2M range, but the other
> case (mprotect) invalidates the 2M range, then the small range.
>
> Since there's no documentation for these root partition hypercalls, it sure
> would be nice if this info could be captured in code comments for some
> future developer to benefit from. If that's not something you want to
> worry about, I could submit a patch later to add the code comments
> (subject to your review, of course).
>
Please, feel free ti add the comments you see fit. I think you'll do it
better as you have a better understanding of what needs to be
documented.
Thanks,
Stanislav
> Separately, in looking at this, I spotted another potential problem with
> 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> not clear on. To create a new region, the user space VMM issues the
> MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> size, and the guest PFN. The only requirement on these values is that the
> userspace address and size be page aligned. But suppose a 4 Meg region is
> specified where the userspace address and the guest PFN have different
> offsets modulo 2 Meg. The userspace address range gets populated first,
> and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> to create a 2 Meg mapping for the guest, the corresponding system PFN in
> the page array may not be 2 Meg aligned. What does the hypervisor do in
> this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> to creating 4K mappings, or does it return an error? Returning an error would
> seem to be problematic for movable pages because the error wouldn't
> occur until the guest VM is running and takes a range fault on the region.
> Silently falling back to creating 4K mappings has performance implications,
> though I guess it would work. My question is whether the
> MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> error immediately.
>
> Michael
>
> > >
> > > [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/
Powered by blists - more mailing lists