linux-kernel - Re: [PATCH 2/2] mshv: Add support for integrated scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXzqsfT8-h-g9mex@anirudh-surface.localdomain>
Date: Fri, 30 Jan 2026 17:30:25 +0000
From: Anirudh Rayabharam <anirudh@...rudhrb.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
Cc: Michael Kelley <mhklinux@...look.com>,
	"kys@...rosoft.com" <kys@...rosoft.com>,
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
	"wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>,
	"longli@...rosoft.com" <longli@...rosoft.com>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] mshv: Add support for integrated scheduler

On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > 
> > > From: Andreea Pintilie <anpintil@...rosoft.com>
> > > 
> > > Query the hypervisor for integrated scheduler support and use it if
> > > configured.
> > > 
> > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > root scheduler allows the root partition to schedule guest vCPUs across
> > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > scheduling entirely to the hypervisor.
> > > 
> > > Direct virtualization introduces a new privileged guest partition type - L1
> > > Virtual Host (L1VH) — which can create child partitions from its own
> > > resources. These child partitions are effectively siblings, scheduled by
> > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > (typically round-robin across all allocated physical CPUs). As a result,
> > > the system may appear to "steal" time from the L1VH and its children.
> > > 
> > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > guests across its "physical" cores, effectively emulating root scheduler
> > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > rest of the system.
> > > 
> > > The integrated scheduler is controlled by the root partition and gated by
> > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > supports the integrated scheduler. The L1VH partition must then check if it
> > > is enabled by querying the corresponding extended partition property. If
> > > this property is true, the L1VH partition must use the root scheduler
> > > logic; otherwise, it must use the core scheduler.
> > > 
> > > Signed-off-by: Andreea Pintilie <anpintil@...rosoft.com>
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > >  include/hyperv/hvhdk_mini.h |    6 +++
> > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > 
> 
> <snip>
> 
> > >  static int __init mshv_root_partition_init(struct device *dev)
> > >  {
> > >  	int err;
> > > 
> > > -	err = root_scheduler_init(dev);
> > > -	if (err)
> > > -		return err;
> > > -
> > >  	err = register_reboot_notifier(&mshv_reboot_nb);
> > >  	if (err)
> > > -		goto root_sched_deinit;
> > > +		return err;
> > > 
> > >  	return 0;
> > 
> > This code is now:
> > 
> > 	if (err)
> > 		return err;
> > 	return 0;
> > 
> > which can be simplified to just:
> > 
> > 	return err;
> > 
> > Or drop the local variable 'err' and simplify the entire function to:
> > 
> > 	return register_reboot_notifier(&mshv_reboot_nb);
> > 
> > There's a tangential question here: Why is this reboot notifier
> > needed in the first place? All it does is remove the cpuhp state
> > that allocates/frees the per-cpu root_scheduler_input and
> > root_scheduler_output pages. Removing the state will free
> > the pages, but if Linux is rebooting, why bother?
> > 
> 
> This was originally done to support kexec.
> Here is the original commit message:
> 
>     mshv: perform synic cleanup during kexec
> 
>     Register a reboot notifier that performs synic cleanup when a kexec
>     is in progress.
> 
>     One notable issue this commit fixes is one where after a kexec, virtio
>     devices are not functional. Linux root partition receives MMIO doorbell
>     events in the ring buffer in the SIRB synic page. The hypervisor maintains
>     a head pointer where it writes new events into the ring buffer. The root
>     partition maintains a tail pointer to read events from the buffer.
> 
>     Upon kexec reboot, all root data structures are re-initialized and thus the
>     tail pointer gets reset to zero. The hypervisor on the other hand still
>     retains the pre-kexec head pointer which could be non-zero. This means that
>     when the hypervisor writes new events to the ring buffer, the root
>     partition looks at the wrong place and doesn't find any events. So, future
>     doorbell events never get delivered. As a result, virtqueue kicks never get
>     delivered to the host.
> 
>     When the SIRB page is disabled the hypervisor resets the head pointer.
> 
> > > -root_sched_deinit:
> > > -	root_scheduler_deinit();
> > > -	return err;
> > >  }
> > > 
> > > -static void mshv_init_vmm_caps(struct device *dev)
> > > +static int mshv_init_vmm_caps(struct device *dev)
> > >  {
> > > -	/*
> > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > -	 */
> > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > -					      0, &mshv_root.vmm_caps,
> > > -					      sizeof(mshv_root.vmm_caps)))
> > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > +	int ret;
> > > +
> > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > +						0, &mshv_root.vmm_caps,
> > > +						sizeof(mshv_root.vmm_caps));
> > > +	if (ret) {
> > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > +		return ret;
> > > +	}
> > 
> > This is a functional change that isn't mentioned in the commit message.
> > Why is it now appropriate to fail instead of treating the VMM capabilities
> > as all disabled? Presumably there are older versions of the hypervisor that
> > don't support the requirements described in the original comment, but
> > perhaps they are no longer relevant?
> > 
> 
> To fail is now the only option for the L1VH partition. It must discover
> the scheduler type. Without this information, the partition cannot
> operate. The core scheduler logic will not work with an integrated
> scheduler, and vice versa.

I don't think we need to fail here. If we don't find vmm caps, that
means we are on an older hypervisor that supports l1vh but not
integrated scheduler (yes, such a version exists). In this case since
integrated scheduler is not supported by the hypervisor, the core
scheduler logic will work.

Thanks,
Anirudh.

> 
> And yes, older hypervisor versions do not support L1VH.
> 
> Thanks,
> Stanislav
> 
> > > 
> > >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > > +
> > > +	return 0;
> > >  }
> > > 
> > >  static int __init mshv_parent_partition_init(void)
> > > @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
> > > 
> > >  	mshv_cpuhp_online = ret;
> > > 
> > > +	ret = mshv_init_vmm_caps(dev);
> > > +	if (ret)
> > > +		goto remove_cpu_state;
> > > +
> > >  	ret = mshv_retrieve_scheduler_type(dev);
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > > @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > > 
> > > -	mshv_init_vmm_caps(dev);
> > > +	ret = root_scheduler_init(dev);
> > > +	if (ret)
> > > +		goto exit_partition;
> > > 
> > >  	ret = mshv_irqfd_wq_init();
> > >  	if (ret)
> > > -		goto exit_partition;
> > > +		goto deinit_root_scheduler;
> > > 
> > >  	spin_lock_init(&mshv_root.pt_ht_lock);
> > >  	hash_init(mshv_root.pt_htable);
> > > @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
> > > 
> > >  	return 0;
> > > 
> > > +deinit_root_scheduler:
> > > +	root_scheduler_deinit();
> > >  exit_partition:
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > > @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> > >  	mshv_port_table_fini();
> > >  	misc_deregister(&mshv_dev);
> > >  	mshv_irqfd_wq_cleanup();
> > > +	root_scheduler_deinit();
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > >  	cpuhp_remove_state(mshv_cpuhp_online);
> > > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > > index aa03616f965b..0f7178fa88a8 100644
> > > --- a/include/hyperv/hvhdk_mini.h
> > > +++ b/include/hyperv/hvhdk_mini.h
> > > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> > >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> > >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> > > 
> > > +	/* Integrated scheduling properties */
> > > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > > +
> > >  	/* Resource properties */
> > >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> > >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> > >  };
> > > 
> > >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
> > > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> > > 
> > >  struct hv_partition_property_vmm_capabilities {
> > >  	u16 bank_count;
> > > @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> > >  #endif
> > >  			u64 assignable_synthetic_proc_features: 1;
> > >  			u64 tag_hv_message_from_child: 1;
> > > +			u64 vmm_enable_integrated_scheduler : 1;
> > >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> > >  		} __packed;
> > >  	};
> > > 
> > > 
> >