[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYTPh5oUBu-OWPlx@skinsburskii.localdomain>
Date: Thu, 5 Feb 2026 09:12:39 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Anirudh Rayabharam <anirudh@...rudhrb.com>
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
decui@...rosoft.com, longli@...rosoft.com,
linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
On Thu, Feb 05, 2026 at 04:59:35AM +0000, Anirudh Rayabharam wrote:
> On Wed, Feb 04, 2026 at 10:33:11AM -0800, Stanislav Kinsburskii wrote:
> > On Wed, Feb 04, 2026 at 05:33:29AM +0000, Anirudh Rayabharam wrote:
> > > On Tue, Feb 03, 2026 at 11:42:58AM -0800, Stanislav Kinsburskii wrote:
> > > > On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote:
> > > > > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > > > > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > > > > > > > withdrawn).
> > > > > > > > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > > > > > > > host partition.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > > > > > > > irrespective of userspace behavior?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > > > > > > >
> > > > > > > > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > > > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > > > > > > > may still crash.
> > > > > > > > > > >
> > > > > > > > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > > > > > > > function during a kexec. Userspace processes would've been killed by
> > > > > > > > > > > then.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > > > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > > > > > > kexec.
> > > > > > > > >
> > > > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > > > > > > more graceful and is probably used more commonly. In this case at least
> > > > > > > > > we could register a reboot notifier and attempt to clean things up.
> > > > > > > > >
> > > > > > > > > I think it is better to support kexec to this extent rather than
> > > > > > > > > disabling it entirely.
> > > > > > > > >
> > > > > > > >
> > > > > > > > You do understand that once our kernel is released to third parties, we
> > > > > > > > can’t control how they will use kexec, right?
> > > > > > >
> > > > > > > Yes, we can't. But that's okay. It is fine for us to say that only some
> > > > > > > kexec scenarios are supported and some aren't (iff you're creating VMs
> > > > > > > using MSHV; if you're not creating VMs all of kexec is supported).
> > > > > > >
> > > > > >
> > > > > > Well, I disagree here. If we say the kernel supports MSHV, we must
> > > > > > provide a robust solution. A partially working solution is not
> > > > > > acceptable. It makes us look careless and can damage our reputation as a
> > > > > > team (and as a company).
> > > > >
> > > > > It won't if we call out upfront what is supported and what is not.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > This is a valid and existing option. We have to account for it. Yet
> > > > > > > > again, L1VH will be used by arbitrary third parties out there, not just
> > > > > > > > by us.
> > > > > > > >
> > > > > > > > We can’t say the kernel supports MSHV until we close these gaps. We must
> > > > > > >
> > > > > > > We can. It is okay say some scenarios are supported and some aren't.
> > > > > > >
> > > > > > > All kexecs are supported if they never create VMs using MSHV. If they do
> > > > > > > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > > > > > > least systemctl kexec and crashdump kexec would which are probably the
> > > > > > > most common uses of kexec. It's okay to say that this is all we support
> > > > > > > as of now.
> > > > > > >
> > > > > >
> > > > > > I'm repeating myself, but I'll try to put it differently.
> > > > > > There won't be any kernel core collected if a page was deposited. You're
> > > > > > arguing for a lost cause here. Once a page is allocated and deposited,
> > > > > > the crash kernel will try to write it into the core.
> > > > >
> > > > > That's why we have to implement something where we attempt to destroy
> > > > > partitions and reclaim memory (and BUG() out if that fails; which
> > > > > hopefully should happen very rarely if at all). This should be *the*
> > > > > solution we work towards. We don't need a temporary disable kexec
> > > > > solution.
> > > > >
> > > >
> > > > No, the solution is to preserve the shared state and pass it over via KHO.
> > >
> > > Okay, then work towards it without doing temporary KEXEC disable. We can
> > > call out that kexec is not supported until then. Disabling KEXEC is too
> > > intrusive.
> > >
> >
> > What do you mean by "too intrusive"? The change if local to driver's
> > Kconfig. There are no verbal "callouts" in upstream Linux - that's
> > exactly what Kconfig is used for. Once the proper solution is
> > implemented, we can remove the restriction.
> >
> > > Is there any precedent for this? Do you know if any driver ever disabled
> > > KEXEC this way?
> > >
> >
> > No, but there is no other similar driver like this one.
>
> Doesn't have to be like this one. There could be issues with device
> states during kexec state.
>
> > Why does it matter though?
>
> To learn from past precedents.
>
> >
> > > >
> > > > > >
> > > > > > > Also, what makes you think customers would even be interested in enabling
> > > > > > > our module in their kernel configs if it takes away kexec?
> > > > > > >
> > > > > >
> > > > > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> > > > > > servicing the existing ones.
> > > > >
> > > > > And what about the L2 VM state then? They might not be throwaway in all
> > > > > cases.
> > > > >
> > > >
> > > > L2 guest can (and likely will) be migrated fromt he old L1VH to the new
> > > > one.
> > > > And this is most likely the current scenario customers are using.
> > > >
> > > > > >
> > > > > > Why do you think there won’t be customers interested in using MSHV in
> > > > > > L1VH without kexec support?
> > > > >
> > > > > Because they could already be using kexec for their servicing needs or
> > > > > whatever. And no we can't just say "don't service these VMs just spin up
> > > > > new ones".
> > > > >
> > > >
> > > > Are you speculating or know for sure?
> > >
> > > It's a reasonable assumption that people are using kexec for servicing.
> > >
> >
> > Again, using kexec for servicing is not supported: why pretending it is?
>
> What this patch effectively asserts is that kexec is unsupported whenever the
> MSHV driver is enabled. But that is not accurate. Enabling MSHV does not
> necessarily imply that it is being used. The correct statement is that kexec is
> unsupported only when MSHV is *in use*, i.e. when one or more VMs are
> running.
>
> By disabling kexec unconditionally, the patch prevents a valid workflow in
> situations where no VMs exist and kexec would work without issue. This imposes a
> blanket restriction instead of enforcing the actual requirement.
>
> And sure, I understand there is no way to enforce that actual
> requirement. So this is what I propose:
>
> The statement "kexec is not supported when the MSHV driver is used" can be
> documented on docs.microsoft.com once direct virtualization becomes broadly
> available. The documentation can also provide operational guidance, such as
> shutting down all VMs before invoking kexec for servicing. This preserves a
> practical path for users who rely on kexec. If kexec is disabled entirely, that
> flexibility is lost.
>
> The stricter approach ensures users cannot accidentally make a mistake, which
> has its merits. However, my approach gives more power and discretion to
> the user. In parallel, we of course continue to work on making it
> robust.
>
The flexibility is much smaller than you described. The host can’t kexec
if a VM was ever created, because we don’t withdraw the host pages.
Even if we try to withdraw pages during kexec, it won’t help with crash
collection. Those pages will be in use and won’t be available to
withdraw.
So the trade-off is between being able to kexec safely only before any
VM has been launched, or blocking it completely.
> >
> > > >
> > > > > Also, keep in mind that once L1VH is available in Azure, the distros
> > > > > that run on it would be the same distros that run on all other Azure
> > > > > VMs. There won't be special distros with a kernel specifically built for
> > > > > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
> > > > > happy that they would need to publish a separate version of their image with
> > > > > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
> > > > > be disabled for all Azure VMs. Also, the customers will be confused why
> > > > > the same distro doesn't work on L1VH.
> > > > >
> > > >
> > > > I don't think distro happiness is our concern. They already build custom
> > >
> > > If distros are not happy they won't package this and consequently
> > > nobody will use it.
> > >
> >
> > Could you provide an example of such issues in the past?
> >
> > > > versions for Azure. They can build another custom version for L1VH if
> > > > needed.
> > >
> > > We should at least check if they are ready to do this.
> > >
> >
> > This is a labor intrusive and long-term check. Unless there is a solid
> > evidence that they won't do it, I don't see the point in doing this.
>
> It is reasonable to assume that maintaining an additional flavor of a
> distro is an overhead (maintain new package(s), maintain Azure
> marketplace images etc etc). This should be enough reason to check. Not
> everything needs a solid evidence. Often times a reasonable suspiscion
> will do.
>
There will be a new kernel flavor anyway. That means a new kernel
package. If we also need a separate distro image for MSHV on Azure VMs,
it will be needed regardless of kexec support. There won’t be a generic
Ubuntu build that works both for regular guest VMs and for L1VH VMs any
time soon.
Thanks,
Stanislav
> Thanks,
> Anirudh.
>
> >
> > Thanks,
> > Stanislav
> >
> > > Thanks,
> > > Anirudh.
> > >
> > > >
> > > > Anyway, I don't see the point in continuing this discussion. All points
> > > > have been made, and solutions have been proposed.
> > > >
> > > > If you can come up with something better in the next few days, so we at
> > > > least have a chance to get it merged in the next merge window, great. If
> > > > not, we should explicitly forbid the unsupported feature and move on.
> > > >
> > > > Thanks,
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > > Thanks,
> > > > > Anirudh.
Powered by blists - more mailing lists