linux-kernel - RE: [PATCH v6 5/5] Drivers: hv: Add support for movable memory regions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB415740A80DA4FF9661040147D4D5A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Fri, 21 Nov 2025 05:45:20 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
	<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>, "linux-hyperv@...r.kernel.org"
	<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v6 5/5] Drivers: hv: Add support for movable memory
 regions

From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Wednesday, November 19, 2025 4:45 PM
> 
> On Tue, Nov 18, 2025 at 04:29:56PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Monday, November 17, 2025 8:53 AM

[snip]

> > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > index 0b8c391a0342..5f1637cbb6e3 100644
> > > --- a/drivers/hv/Kconfig
> > > +++ b/drivers/hv/Kconfig
> > > @@ -75,6 +75,7 @@ config MSHV_ROOT
> > >  	depends on PAGE_SIZE_4KB
> > >  	select EVENTFD
> > >  	select VIRT_XFER_TO_GUEST_WORK
> > > +	select HMM_MIRROR
> >
> > Couldn't you also do "select MMU_NOTIFIER" to avoid the #ifdef's
> > and stubs for when it isn't selected? There are other Linux kernel
> > drivers that select it. Or is the intent to allow building an image that
> > doesn't support unpinned memory, and the #ifdef's save space?
> >
> 
> That's an interesting question. This driver can function without MMU notifiers
> by pinning all memory, which might be advantageous for certain real-time
> applications.
> However, since most other virtualization solutions use MMU_NOTIFIER, there
> doesn't appear to be a strong reason for this driver to deviate.

I'm not clear on your last sentence. Could you elaborate?

> 
> > >  	default n
> > >  	help
> > >  	  Select this option to enable support for booting and running as root

[snip]

> > > +
> > > +	for (i = 0; i < page_count; i++)
> > > +		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> >
> > The comment with hmm_pfn_to_page() says that the caller is assumed to
> > have tested the pfn for HMM_PFN_VALID, which I don't see done anywhere.
> > Is there a reason it's not necessary to test?
> 
> The reason is the HMM_PFN_REQ_FAULT range flag, which requires all PFNs to be
> faulted and populated, or mshv_region_hmm_fault_and_lock will return -EFAULT.
> Additionally, note that mshv_region_hmm_fault_and_lock returns with
> region->mutex held upon success, ensuring that no page can be moved until the
> lock is released.

OK, that makes sense.

> 
> > > +
> > > +	if (PageHuge(region->pages[page_offset]))
> > > +		region->flags.large_pages = true;
> >
> > See comment below in mshv_region_handle_gfn_fault().
> >
> > > +
> > > +	ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > > +				      page_offset, page_count);
> > > +
> > > +	mutex_unlock(&region->mutex);
> > > +out:
> > > +	kfree(pfns);
> > > +	return ret;
> > > +}
> > > +#else /* CONFIG_MMU_NOTIFIER */
> > > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > > +				   u64 page_offset, u64 page_count)
> > > +{
> > > +	return -ENODEV;
> > > +}
> > > +#endif /* CONFIG_MMU_NOTIFIER */
> > > +
> > > +static bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > > +{
> > > +	u64 page_offset, page_count;
> > > +	int ret;
> > > +
> > > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > > +		return false;
> > > +
> > > +	/* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > > +	page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > > +				 MSHV_MAP_FAULT_IN_PAGES);
> > > +
> > > +	/* Map more pages than requested to reduce the number of faults. */
> > > +	page_count = min(region->nr_pages - page_offset,
> > > +			 MSHV_MAP_FAULT_IN_PAGES);
> >
> > These computations make the range defined by page_offset and page_count
> > start on a 512 page boundary relative to start_gfn, and have a size that is a
> > multiple of 512 pages. But they don't ensure that the range aligns to a large page
> > boundary within gfn space since region->start_gfn may not be a multiple of
> > 512. Then mshv_region_range_fault() tests the first page of the range for
> > being a large page, and if so, sets region->large_pages. This doesn't make
> > sense to me if the range doesn't align to a large page boundary.
> >
> > Does this code need to make sure that the range is aligned to a large
> > page boundary in gfn space? Or am I misunderstanding what the
> > region->large_pages flag means? Given the fixes in this v6 of the
> > patch set, I was thinking that region->large_pages means that every
> > large page aligned area within the region is a large page. If region->
> > start_gfn and region->nr_pages aren't multiples of 512, then there
> > may be an initial range and a final range that aren't large pages,
> > but everything in between is. If that's not a correct understanding,
> > could you clarify the exact meaning of the region->large_pages
> > flag?
> >
> 
> That's a good catch. Initially, the approach to memory deposit involved pinning
> and depositing all pages. The code assumes that if the first page in the region
> is huge, it is sufficient to use the "map huge page" flag in the hypercall.

Right. But even for pinned regions as coded today, is that assumption
correct? Due to memory being fragmented at the time of region creation,
it would be possible that some 2Meg ranges in a region are backed by a large
page, while other 2Meg ranges are not. In that case, a single per-region flag
isn't enough information. Or does the hypercall work OK if the "map huge
page" flag is passed when the range isn't a huge page? I'm not clear on what
the hypercall requires as input.

> 
> With this series, the region is sparse by default, reducing the likelihood of
> huge pages in the region. As a result, using this flag seems neither correct
> nor reasonable.

Yep.

> 
> Ideally, whether to use the flag should be determined during each guest memory
> map/unmap operation, rather than relying on the flag set during the initial
> region mapping.
> 
> For now, I will remove the large_pages flag for movable regions in this
> series, as it is the least intrusive change. However, I plan to investigate
> this further and potentially replace the large_pages flag with a runtime
> check in the next series.

OK.  So what is the impact? Losing the perf benefit of mapping guest
memory in the SLAT as a 2 Meg large page vs. a bunch of individual 4K
pages? Anything else?

> 
> > > +
> > > +	ret = mshv_region_range_fault(region, page_offset, page_count);
> > > +
> > > +	WARN_ONCE(ret,
> > > +		  "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > > +		  region->partition->pt_id, region->start_uaddr,
> > > +		  region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > > +		  gfn, page_offset, page_count);
> > > +
> > > +	return !ret;
> > > +}
> > > +
> > > +/**
> > > + * mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
> > > + * @vp: Pointer to the virtual processor structure.
> > > + *
> > > + * This function processes GPA intercepts by identifying the memory region
> > > + * corresponding to the intercepted GPA, aligning the page offset, and
> > > + * mapping the required pages. It ensures that the region is valid and
> > > + * handles faults efficiently by mapping multiple pages at once.
> > > + *
> > > + * Return: true if the intercept was handled successfully, false otherwise.
> > > + */
> > > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > +{
> > > +	struct mshv_partition *p = vp->vp_partition;
> > > +	struct mshv_mem_region *region;
> > > +	struct hv_x64_memory_intercept_message *msg;
> > > +	u64 gfn;
> > > +
> > > +	msg = (struct hv_x64_memory_intercept_message *)
> > > +		vp->vp_intercept_msg_page->u.payload;
> > > +
> > > +	gfn = HVPFN_DOWN(msg->guest_physical_address);
> > > +
> > > +	region = mshv_partition_region_by_gfn(p, gfn);
> > > +	if (!region)
> > > +		return false;
> >
> > Does it ever happen that the gfn is legitimately not found in any
> > region, perhaps due to a race? I think the vp_mutex is held here,
> > so maybe that protects the region layout for the VP and "not found"
> > should never occur. If so, should there be a WARN_ON here?
> >
> > If "gfn not found" can be legitimate, perhaps a comment to
> > explain the circumstances would be helpful.
> >
> 
> This is possible, if hypervisor returns some invalid GFN.
> But there is also a possibility, that this code can race with region removal from a guest.
> I'll address it in the next revision.

In either of these cases, what happens next? The MSHV_RUN_VP ioctl
will return to user space with the unhandled HVMSG_GPA_INTERCEPT
message. Is there anything user space can do to enable the VP to make
progress past the fault? Or does user space just have to terminate the
guest VM?

> 
> > > +
> > > +	if (WARN_ON_ONCE(!region->flags.is_ram))
> > > +		return false;
> > > +
> > > +	if (WARN_ON_ONCE(region->flags.range_pinned))
> > > +		return false;
> > > +
> > > +	return mshv_region_handle_gfn_fault(region, gfn);
> > > +}
> > > +
> > > +#else	/* CONFIG_X86_64 */
> > > +
> > > +static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > +
> > > +#endif	/* CONFIG_X86_64 */
> > > +
> > > +static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > +{
> > > +	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > +	case HVMSG_GPA_INTERCEPT:
> > > +		return mshv_handle_gpa_intercept(vp);
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > >  static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
> > >  {
> > >  	long rc;
> > >
> > > -	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > > -		rc = mshv_run_vp_with_root_scheduler(vp);
> > > -	else
> > > -		rc = mshv_run_vp_with_hyp_scheduler(vp);
> > > +	do {
> > > +		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
> > > +			rc = mshv_run_vp_with_root_scheduler(vp);
> > > +		else
> > > +			rc = mshv_run_vp_with_hyp_scheduler(vp);
> > > +	} while (rc == 0 && mshv_vp_handle_intercept(vp));
> > >
> > >  	if (rc)
> > >  		return rc;
> > > @@ -1194,6 +1385,110 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > >  	return NULL;
> > >  }
> > >
> > > +#if defined(CONFIG_MMU_NOTIFIER)
> > > +static void mshv_region_movable_fini(struct mshv_mem_region *region)
> > > +{
> > > +	if (region->flags.range_pinned)
> > > +		return;
> > > +
> > > +	mmu_interval_notifier_remove(&region->mni);
> > > +}
> > > +
> > > +/**
> > > + * mshv_region_interval_invalidate - Invalidate a range of memory region
> > > + * @mni: Pointer to the mmu_interval_notifier structure
> > > + * @range: Pointer to the mmu_notifier_range structure
> > > + * @cur_seq: Current sequence number for the interval notifier
> > > + *
> > > + * This function invalidates a memory region by remapping its pages with
> > > + * no access permissions. It locks the region's mutex to ensure thread safety
> > > + * and updates the sequence number for the interval notifier. If the range
> > > + * is blockable, it uses a blocking lock; otherwise, it attempts a non-blocking
> > > + * lock and returns false if unsuccessful.
> > > + *
> > > + * NOTE: Failure to invalidate a region is a serious error, as the pages will
> > > + * be considered freed while they are still mapped by the hypervisor.
> > > + * Any attempt to access such pages will likely crash the system.
> > > + *
> > > + * Return: true if the region was successfully invalidated, false otherwise.
> > > + */
> > > +static bool
> > > +mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > > +				const struct mmu_notifier_range *range,
> > > +				unsigned long cur_seq)
> > > +{
> > > +	struct mshv_mem_region *region = container_of(mni,
> > > +						struct mshv_mem_region,
> > > +						mni);
> > > +	u64 page_offset, page_count;
> > > +	unsigned long mstart, mend;
> > > +	int ret = -EPERM;
> > > +
> > > +	if (mmu_notifier_range_blockable(range))
> > > +		mutex_lock(&region->mutex);
> > > +	else if (!mutex_trylock(&region->mutex))
> > > +		goto out_fail;
> > > +
> > > +	mmu_interval_set_seq(mni, cur_seq);
> > > +
> > > +	mstart = max(range->start, region->start_uaddr);
> > > +	mend = min(range->end, region->start_uaddr +
> > > +		   (region->nr_pages << HV_HYP_PAGE_SHIFT));
> >
> > I'm pretty sure region->start_uaddr is always page aligned. But what
> > about range->start and range->end?  The code here and below assumes
> > they are page aligned. It also assumes that range->end is greater than
> > range->start so the computation of page_count doesn't wrap and so
> > page_count is >= 1. I don't know whether checks for these assumptions
> > are appropriate.
> >
> 
> There is a check for memory region size to be non-zero and page aligned
> in mshv_partition_ioct_set_memory function, which is the only caller for
> memory region creation. And region start is defined in PFNs.
> 

Right -- no disagreement that the region start and size are page aligned
and non-zero. But what about the range that is being invalidated?
(i.e., range->start and range->end) The values in that range are coming
from the mm subsystem, and aren't governed by how a region is created.
If that range is a subset of the MSHV region, then
mshv_region_internal_invalidate() will be operating on whatever subset
was provided in the 'range' argument.

mshv_region_interval_invalidate() is ultimately called from
mmu_notifier_invalidate_range_start(), which has about 30 different
callers in the kernel, mostly in the mm subsystem. It wasn't
clear to me what rules, if any, those 30 callers are following when they
set up the range to be invalidated. 

Michael