[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 11 Dec 2025 17:37:26 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
"decui@...rosoft.com" <decui@...rosoft.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v7 4/7] Drivers: hv: Fix huge page handling in memory
region traversal
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Thursday, December 4, 2025 1:09 PM
>
> On Thu, Dec 04, 2025 at 04:03:26PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday, November 25, 2025 6:09 PM
> > >
[snip]
> > > +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset, u64 page_count,
> > > + int (*handler)(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset,
> > > + u64 page_count))
> > > +{
> > > + u64 count, stride;
> > > + unsigned int page_order;
> > > + struct page *page;
> > > + int ret;
> > > +
> > > + page = region->pages[page_offset];
> > > + if (!page)
> > > + return -EINVAL;
> > > +
> > > + page_order = folio_order(page_folio(page));
> > > + /* 1G huge pages aren't supported by the hypercalls */
> > > + if (page_order == PUD_ORDER)
> > > + return -EINVAL;
> >
> > In the general case, folio_order() could return a variety of values ranging from
> > 0 up to at least PUD_ORDER. For example, "2" would be valid value in file
> > system code that uses folios to do I/O in 16K blocks instead of just 4K blocks.
> > Since this function is trying to find contiguous chunks of either single pages
> > or 2M huge pages, I think you are expecting only three possible values: 0,
> > PMD_ORDER, or PUD_ORDER. But do you know that "2" (for example)
> > would never be returned? The memory involved here is populated using
> > pin_user_pages_fast() for the pinned case, or using hmm_range_fault() for
> > the movable case. I don't know mm behavior well enough to know if those
> > functions could ever populate with a folio with an order other than 0 or
> > PMD_ORDER. If such a folio could ever be used, then the way you check
> > for a page size change won't be valid. For purposes of informing Hyper-V
> > about 2 Meg pages, folio orders 0 through 8 are all equivalent, with folio
> > order 9 (PMD_ORDER) being the marker for the start of 2 Meg large page.
> >
> > Somebody who knows mm behavior better than I do should comment. Or
> > maybe you could just be more defensive and handle the case of folio orders
> > not equal to 0 or PMD_ORDER.
> >
>
> Thanks for the comment.
> This is addressed this in v9 by expclitly checking for 0 and HUGE_PMD_ORDER.
>
> > > +
> > > + stride = 1 << page_order;
> > > +
> > > + /* Start at stride since the first page is validated */
> > > + for (count = stride; count < page_count; count += stride) {
> >
> > This striding doesn't work properly in the general case. Suppose the
> > page_offset value puts the start of the chunk in the middle of a 2 Meg
> > page, and that 2 Meg page is then followed by a bunch of single pages.
> > (Presumably the mmu notifier "invalidate" callback could do this.)
> > The use of the full stride here jumps over the remaining portion of the
> > 2 Meg page plus some number of the single pages, which isn't what you
> > want. For the striding to work, it must figure out how much remains in the
> > initial large page, and then once the striding is aligned to the large page
> > boundaries, the full stride length works.
> >
> > Also, what do the hypercalls in the handler functions do if a chunk starts
> > in the middle of a 2 Meg page? It looks like the handler functions will set
> > the *_LARGE_PAGE flag to the hypercall but then the hv_call_* function
> > will fail if the page_count isn't 2 Meg aligned.
> >
>
> This situation you described is not possible, because invalidation
> callback simply can't invalidate a part of the huge page even in THP
> case (leave aside hugetlb case) without splitting it beforehand, and
> splitting a huge page requires invalidation of the whole huge page
> first.
I've been playing around with mmu notifiers and 2 Meg pages. At least in my
experiment, there's a case where the .invalidate callback is invoked on a
range *before* the 2 Meg page is split. The kernel code that does this is
in zap_page_range_single_batched(). Early on this function calls
mmu_notifier_invalidate_range_start(), which invokes the .invalidate
callback on the initial range. Later on, unmap_single_vma() is called, which
does the split and eventually makes a second .invalidate callback for the
entire 2 Meg page.
Details: My experiment is a user space program that does the following:
1. Allocates 16 Megs of memory on a 16 Meg boundary using
posix_memalign(). So this is private anonymous memory. Transparent
huge pages are enabled.
2. Writes to a byte in each 4K page so they are all populated.
/proc/meminfo shows eight 2 Meg pages have been allocated.
3. Creates an mmu notifier for the allocated 16 Megs, using an ioctl
hacked into the kernel for experimentation purposes.
4. Uses madvise() with the DONTNEED option to free 32 Kbytes on a 4K
page boundary somewhere in the 16 Meg allocation. This results in an mmu
notifier invalidate callback for that 32 Kbytes. Then there's a second invalidate
callback covering the entire 2 Meg page that contains the 32 Kbyte range.
Kernel stack traces for the two invalidate callbacks show them originating
in zap_page_range_single_batched().
5. Sleeps for 60 seconds. During that time, khugepaged wakes up and does
hpage_collapse_scan_pmd() -> collapse_huge_page(), which generates a third
.invalidate callback for the 2 Meg page. I'm haven't investigated what this is
all about.
6. Interestingly, if Step 4 above does a slightly different operation using
mprotect() with PROT_READ instead of madvise(), the 2 Meg page is split first.
The .invalidate callback for the full 2 Meg happens before the .invalidate
callback for the specified range.
The root partition probably isn't doing madvise() with DONTNEED for memory
allocated for guests. But regardless of what user space does or doesn't do, MSHV's
invalidate callback path should be made safe for this case. Maybe that's just
detecting it and returning an error (and maybe a WARN_ON) if user space
doesn't need it to work.
Michael
>
> > > + page = region->pages[page_offset + count];
> > > +
> > > + /* Break if current page is not present */
> > > + if (!page)
> > > + break;
> > > +
> > > + /* Break if page size changes */
> > > + if (page_order != folio_order(page_folio(page)))
> > > + break;
> > > + }
> > > +
> > > + ret = handler(region, flags, page_offset, count);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + return count;
> > > +}
> > > +
> > > +/**
> > > + * mshv_region_process_range - Processes a range of memory pages in a
> > > + * region.
> > > + * @region : Pointer to the memory region structure.
> > > + * @flags : Flags to pass to the handler.
> > > + * @page_offset: Offset into the region's pages array to start processing.
> > > + * @page_count : Number of pages to process.
> > > + * @handler : Callback function to handle each chunk of contiguous
> > > + * pages.
> > > + *
> > > + * Iterates over the specified range of pages in @region, skipping
> > > + * non-present pages. For each contiguous chunk of present pages, invokes
> > > + * @handler via mshv_region_process_chunk.
> > > + *
> > > + * Note: The @handler callback must be able to handle both normal and huge
> > > + * pages.
> > > + *
> > > + * Returns 0 on success, or a negative error code on failure.
> > > + */
> > > +static int mshv_region_process_range(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset, u64 page_count,
> > > + int (*handler)(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset,
> > > + u64 page_count))
> > > +{
> > > + long ret;
> > > +
> > > + if (page_offset + page_count > region->nr_pages)
> > > + return -EINVAL;
> > > +
> > > + while (page_count) {
> > > + /* Skip non-present pages */
> > > + if (!region->pages[page_offset]) {
> > > + page_offset++;
> > > + page_count--;
> > > + continue;
> > > + }
> > > +
> > > + ret = mshv_region_process_chunk(region, flags,
> > > + page_offset,
> > > + page_count,
> > > + handler);
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + page_offset += ret;
> > > + page_count -= ret;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> > > u64 uaddr, u32 flags,
> > > bool is_mmio)
> > > @@ -33,55 +151,80 @@ struct mshv_mem_region *mshv_region_create(u64
> guest_pfn, u64 nr_pages,
> > > if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> > > region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> > >
> > > - /* Note: large_pages flag populated when we pin the pages */
> > > if (!is_mmio)
> > > region->flags.range_pinned = true;
> > >
> > > return region;
> > > }
> > >
> > > +static int mshv_region_chunk_share(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset, u64 page_count)
> > > +{
> > > + if (PageTransCompound(region->pages[page_offset]))
> > > + flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > mshv_region_process_chunk() uses folio_size() to detect single pages vs. 2 Meg
> > large pages. Here you are using PageTransCompound(). Any reason for the
> > difference? This may be perfectly OK, but my knowledge of mm is too limited to
> > know for sure. Looking at the implementations of folio_size() and
> > PageTransCompound(), they seem to be looking at different fields in the struct page,
> > and I don't know if the different fields are always in sync. Another case for someone
> > with mm expertise to review carefully ....
> >
>
> Indeed, folio_order could be used here as well PageTransCompound could
> be used in the chunk processing function (but then the size of the page
> would still needed to be checked).
> On the other hand, there is subtle difference between the chunk
> procesing function and the callback in calls: the latter doesn't
> validate the input, thus the chunk processing function should.
>
> Thanks,
> Stanislav
>
> > Michael
> >
> > > +
> > > + return hv_call_modify_spa_host_access(region->partition->pt_id,
> > > + region->pages + page_offset,
> > > + page_count,
> > > + HV_MAP_GPA_READABLE |
> > > + HV_MAP_GPA_WRITABLE,
> > > + flags, true);
> > > +}
> > > +
> > > int mshv_region_share(struct mshv_mem_region *region)
> > > {
> > > u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
> > >
> > > - if (region->flags.large_pages)
> > > + return mshv_region_process_range(region, flags,
> > > + 0, region->nr_pages,
> > > + mshv_region_chunk_share);
> > > +}
> > > +
> > > +static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset, u64 page_count)
> > > +{
> > > + if (PageTransCompound(region->pages[page_offset]))
> > > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > >
> > > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > > - region->pages, region->nr_pages,
> > > - HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE,
> > > - flags, true);
> > > + region->pages + page_offset,
> > > + page_count, 0,
> > > + flags, false);
> > > }
> > >
> > > int mshv_region_unshare(struct mshv_mem_region *region)
> > > {
> > > u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
> > >
> > > - if (region->flags.large_pages)
> > > - flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > > -
> > > - return hv_call_modify_spa_host_access(region->partition->pt_id,
> > > - region->pages, region->nr_pages,
> > > - 0,
> > > - flags, false);
> > > + return mshv_region_process_range(region, flags,
> > > + 0, region->nr_pages,
> > > + mshv_region_chunk_unshare);
> > > }
> > >
> > > -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > > - u32 map_flags,
> > > +static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > > + u32 flags,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - if (page_offset + page_count > region->nr_pages)
> > > - return -EINVAL;
> > > -
> > > - if (region->flags.large_pages)
> > > - map_flags |= HV_MAP_GPA_LARGE_PAGE;
> > > + if (PageTransCompound(region->pages[page_offset]))
> > > + flags |= HV_MAP_GPA_LARGE_PAGE;
> > >
> > > return hv_call_map_gpa_pages(region->partition->pt_id,
> > > region->start_gfn + page_offset,
> > > - page_count, map_flags,
> > > + page_count, flags,
> > > region->pages + page_offset);
> > > }
> > >
> > > +static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > > + u32 map_flags,
> > > + u64 page_offset, u64 page_count)
> > > +{
> > > + return mshv_region_process_range(region, map_flags,
> > > + page_offset, page_count,
> > > + mshv_region_chunk_remap);
> > > +}
> > > +
> > > int mshv_region_map(struct mshv_mem_region *region)
> > > {
> > > u32 map_flags = region->hv_map_flags;
> > > @@ -134,9 +277,6 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > goto release_pages;
> > > }
> > >
> > > - if (PageHuge(region->pages[0]))
> > > - region->flags.large_pages = true;
> > > -
> > > return 0;
> > >
> > > release_pages:
> > > @@ -144,10 +284,28 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > return ret;
> > > }
> > >
> > > +static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 page_offset, u64 page_count)
> > > +{
> > > + if (PageTransCompound(region->pages[page_offset]))
> > > + flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > > +
> > > + return hv_call_unmap_gpa_pages(region->partition->pt_id,
> > > + region->start_gfn + page_offset,
> > > + page_count, 0);
> > > +}
> > > +
> > > +static int mshv_region_unmap(struct mshv_mem_region *region)
> > > +{
> > > + return mshv_region_process_range(region, 0,
> > > + 0, region->nr_pages,
> > > + mshv_region_chunk_unmap);
> > > +}
> > > +
> > > void mshv_region_destroy(struct mshv_mem_region *region)
> > > {
> > > struct mshv_partition *partition = region->partition;
> > > - u32 unmap_flags = 0;
> > > int ret;
> > >
> > > hlist_del(®ion->hnode);
> > > @@ -162,12 +320,7 @@ void mshv_region_destroy(struct mshv_mem_region
> *region)
> > > }
> > > }
> > >
> > > - if (region->flags.large_pages)
> > > - unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > > -
> > > - /* ignore unmap failures and continue as process may be exiting */
> > > - hv_call_unmap_gpa_pages(partition->pt_id, region->start_gfn,
> > > - region->nr_pages, unmap_flags);
> > > + mshv_region_unmap(region);
> > >
> > > mshv_region_invalidate(region);
> > >
> > > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > > index 0366f416c2f0..ff3374f13691 100644
> > > --- a/drivers/hv/mshv_root.h
> > > +++ b/drivers/hv/mshv_root.h
> > > @@ -77,9 +77,8 @@ struct mshv_mem_region {
> > > u64 start_uaddr;
> > > u32 hv_map_flags;
> > > struct {
> > > - u64 large_pages: 1; /* 2MiB */
> > > u64 range_pinned: 1;
> > > - u64 reserved: 62;
> > > + u64 reserved: 63;
> > > } flags;
> > > struct mshv_partition *partition;
> > > struct page *pages[];
> > >
> > >
> >
Powered by blists - more mailing lists