linux-kernel - RE: [PATCH] mshv: Align huge page stride with guest mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157D69A4C08B0A4FE01F9FED4A8A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 18 Dec 2025 19:41:24 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>,
	"kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
	<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>, "longli@...rosoft.com"
	<longli@...rosoft.com>
CC: "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] mshv: Align huge page stride with guest mapping

From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday, December 16, 2025 4:41 PM
> 
> Ensure that a stride larger than 1 (huge page) is only used when both
> the guest frame number (gfn) and the operation size (page_count) are
> aligned to the huge page size (PTRS_PER_PMD). This matches the
> hypervisor requirement that map/unmap operations for huge pages must be
> guest-aligned and cover a full huge page.
> 
> Add mshv_chunk_stride() to encapsulate this alignment and page-order
> validation, and plumb a huge_page flag into the region chunk handlers.
> This prevents issuing large-page map/unmap/share operations that the
> hypervisor would reject due to misaligned guest mappings.

This code looks good to me on the surface. But I can only make an educated
guess as to the hypervisor behavior in certain situations, and if my guess is
correct there's still a flaw in one case.

Consider the madvise() DONTNEED experiment that I previously called out. [1]
I surmise that the intent of this patch is to make that case work correctly.
When the .invalidate callback is made for the 32 Kbyte range embedded in
a previously mapped 2 Meg page, this new code detects that case. It calls the
hypervisor to remap the 32 Kbyte range for no access, and clears the 8
corresponding entries in the struct page array attached to the mshv region. The
call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
flag, my guess is that the hypervisor is smart enough to handle this case by
splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
and returning "success". If my guess is correct, there's no problem here.

But then there's a second .invalidate callback for the entire 2 Meg page. Here's
the call stack:

[  194.259337]  dump_stack+0x14/0x20
[  194.259339]  mhktest_invalidate+0x2a/0x40  [my dummy invalidate callback]
[  194.259342]  __mmu_notifier_invalidate_range_start+0x1f4/0x250
[  194.259347]  __split_huge_pmd+0x14f/0x170
[  194.259349]  unmap_page_range+0x104d/0x1a00
[  194.259358]  unmap_single_vma+0x7d/0xc0
[  194.259360]  zap_page_range_single_batched+0xe0/0x1c0
[  194.259363]  madvise_vma_behavior+0xb01/0xc00
[  194.259366]  madvise_do_behavior.part.0+0x3cd/0x4a0
[  194.259375]  do_madvise+0xc7/0x170
[  194.259380]  __x64_sys_madvise+0x2f/0x40
[  194.259382]  x64_sys_call+0x1d77/0x21b0
[  194.259385]  do_syscall_64+0x56/0x640
[  194.259388]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
page is actually split by the root partition. So mshv_chunk_stride() returns "9"
for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
set. My guess is that the hypervisor returns an error because it has already
split the mapping. The whole point of this patch set is to avoid passing
HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
is not a large page mapping, but this looks like a case where it still happens.

My concern is solely from looking at the code and thinking about the problem,
as I don't have an environment where I can test root partition interactions
with the hypervisor. So maybe I'm missing something. Lemme know what you
think .....

Michael

[1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/

> 
> Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |   94 ++++++++++++++++++++++++++++++---------------
>  1 file changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 30bacba6aec3..29776019bcde 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -19,6 +19,42 @@
> 
>  #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
> 
> +/**
> + * mshv_chunk_stride - Compute stride for mapping guest memory
> + * @page      : The page to check for huge page backing
> + * @gfn       : Guest frame number for the mapping
> + * @page_count: Total number of pages in the mapping
> + *
> + * Determines the appropriate stride (in pages) for mapping guest memory.
> + * Uses huge page stride if the backing page is huge and the guest mapping
> + * is properly aligned; otherwise falls back to single page stride.
> + *
> + * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + */
> +static int mshv_chunk_stride(struct page *page,
> +			     u64 gfn, u64 page_count)
> +{
> +	unsigned int page_order;
> +
> +	page_order = folio_order(page_folio(page));
> +	/* The hypervisor only supports 4K and 2M page sizes */
> +	if (page_order && page_order != PMD_ORDER)
> +		return -EINVAL;
> +
> +	/*
> +	 * Default to a single page stride. If page_order is set and both
> +	 * the guest frame number (gfn) and page_count are huge-page
> +	 * aligned (PTRS_PER_PMD), use a larger stride so the mapping can
> +	 * be backed by a huge page in both guest and hypervisor.
> +	 */
> +	if (page_order &&
> +	    IS_ALIGNED(gfn, PTRS_PER_PMD) &&
> +	    IS_ALIGNED(page_count, PTRS_PER_PMD))
> +		return 1 << page_order;
> +
> +	return 1;
> +}
> +
>  /**
>   * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
>   *                             in a region.
> @@ -45,25 +81,23 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  				      int (*handler)(struct mshv_mem_region *region,
>  						     u32 flags,
>  						     u64 page_offset,
> -						     u64 page_count))
> +						     u64 page_count,
> +						     bool huge_page))
>  {
> -	u64 count, stride;
> -	unsigned int page_order;
> +	u64 gfn = region->start_gfn + page_offset;
> +	u64 count;
>  	struct page *page;
> -	int ret;
> +	int stride, ret;
> 
>  	page = region->pages[page_offset];
>  	if (!page)
>  		return -EINVAL;
> 
> -	page_order = folio_order(page_folio(page));
> -	/* The hypervisor only supports 4K and 2M page sizes */
> -	if (page_order && page_order != PMD_ORDER)
> -		return -EINVAL;
> -
> -	stride = 1 << page_order;
> +	stride = mshv_chunk_stride(page, gfn, page_count);
> +	if (stride < 0)
> +		return stride;
> 
> -	/* Start at stride since the first page is validated */
> +	/* Start at stride since the first stride is validated */
>  	for (count = stride; count < page_count; count += stride) {
>  		page = region->pages[page_offset + count];
> 
> @@ -71,12 +105,13 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  		if (!page)
>  			break;
> 
> -		/* Break if page size changes */
> -		if (page_order != folio_order(page_folio(page)))
> +		/* Break if stride size changes */
> +		if (stride != mshv_chunk_stride(page, gfn + count,
> +						page_count - count))
>  			break;
>  	}
> 
> -	ret = handler(region, flags, page_offset, count);
> +	ret = handler(region, flags, page_offset, count, stride > 1);
>  	if (ret)
>  		return ret;
> 
> @@ -108,7 +143,8 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>  				     int (*handler)(struct mshv_mem_region *region,
>  						    u32 flags,
>  						    u64 page_offset,
> -						    u64 page_count))
> +						    u64 page_count,
> +						    bool huge_page))
>  {
>  	long ret;
> 
> @@ -162,11 +198,10 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> 
>  static int mshv_region_chunk_share(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -188,11 +223,10 @@ int mshv_region_share(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
>  				     u32 flags,
> -				     u64 page_offset, u64 page_count)
> +				     u64 page_offset, u64 page_count,
> +				     bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -212,11 +246,10 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MAP_GPA_LARGE_PAGE;
> 
>  	return hv_call_map_gpa_pages(region->partition->pt_id,
> @@ -295,11 +328,10 @@ int mshv_region_pin(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> 
>  	return hv_call_unmap_gpa_pages(region->partition->pt_id,
> 
>