lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXuwes2HNf4Og8lW@skinsburskii.localdomain>
Date: Thu, 29 Jan 2026 11:09:46 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
	"wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>,
	"longli@...rosoft.com" <longli@...rosoft.com>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] mshv: Add support for integrated scheduler

On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > 
> > From: Andreea Pintilie <anpintil@...rosoft.com>
> > 
> > Query the hypervisor for integrated scheduler support and use it if
> > configured.
> > 
> > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > root scheduler allows the root partition to schedule guest vCPUs across
> > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > scheduling entirely to the hypervisor.
> > 
> > Direct virtualization introduces a new privileged guest partition type - L1
> > Virtual Host (L1VH) — which can create child partitions from its own
> > resources. These child partitions are effectively siblings, scheduled by
> > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > CFS, and cpuset controllers can still be used, their effectiveness is
> > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > (typically round-robin across all allocated physical CPUs). As a result,
> > the system may appear to "steal" time from the L1VH and its children.
> > 
> > To address this, Microsoft Hypervisor introduces the integrated scheduler.
>   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > guests across its "physical" cores, effectively emulating root scheduler
> > behavior within the L1VH, while retaining core scheduler behavior for the
> > rest of the system.
> > 
> > The integrated scheduler is controlled by the root partition and gated by
> > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > supports the integrated scheduler. The L1VH partition must then check if it
> > is enabled by querying the corresponding extended partition property. If
> > this property is true, the L1VH partition must use the root scheduler
> > logic; otherwise, it must use the core scheduler.
> > 
> > Signed-off-by: Andreea Pintilie <anpintil@...rosoft.com>
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> >  include/hyperv/hvhdk_mini.h |    6 +++
> >  2 files changed, 58 insertions(+), 27 deletions(-)
> > 

<snip>

> >  static int __init mshv_root_partition_init(struct device *dev)
> >  {
> >  	int err;
> > 
> > -	err = root_scheduler_init(dev);
> > -	if (err)
> > -		return err;
> > -
> >  	err = register_reboot_notifier(&mshv_reboot_nb);
> >  	if (err)
> > -		goto root_sched_deinit;
> > +		return err;
> > 
> >  	return 0;
> 
> This code is now:
> 
> 	if (err)
> 		return err;
> 	return 0;
> 
> which can be simplified to just:
> 
> 	return err;
> 
> Or drop the local variable 'err' and simplify the entire function to:
> 
> 	return register_reboot_notifier(&mshv_reboot_nb);
> 
> There's a tangential question here: Why is this reboot notifier
> needed in the first place? All it does is remove the cpuhp state
> that allocates/frees the per-cpu root_scheduler_input and
> root_scheduler_output pages. Removing the state will free
> the pages, but if Linux is rebooting, why bother?
> 

This was originally done to support kexec.
Here is the original commit message:

    mshv: perform synic cleanup during kexec

    Register a reboot notifier that performs synic cleanup when a kexec
    is in progress.

    One notable issue this commit fixes is one where after a kexec, virtio
    devices are not functional. Linux root partition receives MMIO doorbell
    events in the ring buffer in the SIRB synic page. The hypervisor maintains
    a head pointer where it writes new events into the ring buffer. The root
    partition maintains a tail pointer to read events from the buffer.

    Upon kexec reboot, all root data structures are re-initialized and thus the
    tail pointer gets reset to zero. The hypervisor on the other hand still
    retains the pre-kexec head pointer which could be non-zero. This means that
    when the hypervisor writes new events to the ring buffer, the root
    partition looks at the wrong place and doesn't find any events. So, future
    doorbell events never get delivered. As a result, virtqueue kicks never get
    delivered to the host.

    When the SIRB page is disabled the hypervisor resets the head pointer.

> > -root_sched_deinit:
> > -	root_scheduler_deinit();
> > -	return err;
> >  }
> > 
> > -static void mshv_init_vmm_caps(struct device *dev)
> > +static int mshv_init_vmm_caps(struct device *dev)
> >  {
> > -	/*
> > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > -	 */
> > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > -					      0, &mshv_root.vmm_caps,
> > -					      sizeof(mshv_root.vmm_caps)))
> > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > +	int ret;
> > +
> > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > +						0, &mshv_root.vmm_caps,
> > +						sizeof(mshv_root.vmm_caps));
> > +	if (ret) {
> > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > +		return ret;
> > +	}
> 
> This is a functional change that isn't mentioned in the commit message.
> Why is it now appropriate to fail instead of treating the VMM capabilities
> as all disabled? Presumably there are older versions of the hypervisor that
> don't support the requirements described in the original comment, but
> perhaps they are no longer relevant?
> 

To fail is now the only option for the L1VH partition. It must discover
the scheduler type. Without this information, the partition cannot
operate. The core scheduler logic will not work with an integrated
scheduler, and vice versa.

And yes, older hypervisor versions do not support L1VH.

Thanks,
Stanislav

> > 
> >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > +
> > +	return 0;
> >  }
> > 
> >  static int __init mshv_parent_partition_init(void)
> > @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
> > 
> >  	mshv_cpuhp_online = ret;
> > 
> > +	ret = mshv_init_vmm_caps(dev);
> > +	if (ret)
> > +		goto remove_cpu_state;
> > +
> >  	ret = mshv_retrieve_scheduler_type(dev);
> >  	if (ret)
> >  		goto remove_cpu_state;
> > @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> >  	if (ret)
> >  		goto remove_cpu_state;
> > 
> > -	mshv_init_vmm_caps(dev);
> > +	ret = root_scheduler_init(dev);
> > +	if (ret)
> > +		goto exit_partition;
> > 
> >  	ret = mshv_irqfd_wq_init();
> >  	if (ret)
> > -		goto exit_partition;
> > +		goto deinit_root_scheduler;
> > 
> >  	spin_lock_init(&mshv_root.pt_ht_lock);
> >  	hash_init(mshv_root.pt_htable);
> > @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
> > 
> >  	return 0;
> > 
> > +deinit_root_scheduler:
> > +	root_scheduler_deinit();
> >  exit_partition:
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> >  	mshv_port_table_fini();
> >  	misc_deregister(&mshv_dev);
> >  	mshv_irqfd_wq_cleanup();
> > +	root_scheduler_deinit();
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> >  	cpuhp_remove_state(mshv_cpuhp_online);
> > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > index aa03616f965b..0f7178fa88a8 100644
> > --- a/include/hyperv/hvhdk_mini.h
> > +++ b/include/hyperv/hvhdk_mini.h
> > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> > 
> > +	/* Integrated scheduling properties */
> > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > +
> >  	/* Resource properties */
> >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> >  };
> > 
> >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
> > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> > 
> >  struct hv_partition_property_vmm_capabilities {
> >  	u16 bank_count;
> > @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> >  #endif
> >  			u64 assignable_synthetic_proc_features: 1;
> >  			u64 tag_hv_message_from_child: 1;
> > +			u64 vmm_enable_integrated_scheduler : 1;
> >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> >  		} __packed;
> >  	};
> > 
> > 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ