linux-kernel - Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYIW9PhzqmyET8IL@skinsburskii.localdomain>
Date: Tue, 3 Feb 2026 07:40:36 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Anirudh Rayabharam <anirudh@...rudhrb.com>
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
	decui@...rosoft.com, longli@...rosoft.com,
	linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC

On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > 
> > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > management is implemented.
> > > > > > > > > > > 
> > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > > 
> > > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > withdrawn).
> > > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > > host partition.
> > > > > > > 
> > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > irrespective of userspace behavior?
> > > > > > > 
> > > > > > 
> > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > 
> > > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > > may still crash.
> > > > > 
> > > > > Actually guests won't be running by the time we reach our module_exit
> > > > > function during a kexec. Userspace processes would've been killed by
> > > > > then.
> > > > > 
> > > > 
> > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > We must not rely on OS to do graceful shutdown before doing
> > > > kexec.
> > > 
> > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > more graceful and is probably used more commonly. In this case at least
> > > we could register a reboot notifier and attempt to clean things up.
> > > 
> > > I think it is better to support kexec to this extent rather than
> > > disabling it entirely.
> > > 
> > 
> > You do understand that once our kernel is released to third parties, we
> > can’t control how they will use kexec, right?
> 
> Yes, we can't. But that's okay. It is fine for us to say that only some
> kexec scenarios are supported and some aren't (iff you're creating VMs
> using MSHV; if you're not creating VMs all of kexec is supported).
> 

Well, I disagree here. If we say the kernel supports MSHV, we must
provide a robust solution. A partially working solution is not
acceptable. It makes us look careless and can damage our reputation as a
team (and as a company).

> > 
> > This is a valid and existing option. We have to account for it. Yet
> > again, L1VH will be used by arbitrary third parties out there, not just
> > by us.
> > 
> > We can’t say the kernel supports MSHV until we close these gaps. We must
> 
> We can. It is okay say some scenarios are supported and some aren't.
> 
> All kexecs are supported if they never create VMs using MSHV. If they do
> create VMs using MSHV and we implement cleanup in a reboot notifier at
> least systemctl kexec and crashdump kexec would which are probably the
> most common uses of kexec. It's okay to say that this is all we support
> as of now.
> 

I'm repeating myself, but I'll try to put it differently.
There won't be any kernel core collected if a page was deposited. You're
arguing for a lost cause here. Once a page is allocated and deposited,
the crash kernel will try to write it into the core.

> Also, what makes you think customers would even be interested in enabling
> our module in their kernel configs if it takes away kexec?
> 

It's simple: L1VH isn't a host, so I can spin up new VMs instead of
servicing the existing ones.

Why do you think there won’t be customers interested in using MSHV in
L1VH without kexec support?

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > not depend on user space to keep the kernel safe.
> > 
> > Do you agree?
> > 
> > Thanks,
> > Stanislav
> > 
> > > > 
> > > > > Also, why is this sloppy? Isn't this what module_exit should be
> > > > > doing anyway? If someone unloads our module we should be trying to
> > > > > clean everything up (including killing guests) and reclaim memory.
> > > > > 
> > > > 
> > > > Kexec does not unload modules, but it doesn't really matter even if it
> > > > would.
> > > > There are other means to plug into the reboot flow, but neither of them
> > > > is robust or reliable.
> > > > 
> > > > > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > > > > stop the kexec.
> > > > > 
> > > > 
> > > > By killing the whole system? This is not a good user experience and I
> > > > don't see how can this be justified.
> > > 
> > > It is justified because, as you said, once we reach that failure we can
> > > no longer guarantee integrity. So BUG() makes sense. This BUG() would
> > > cause the system to go for a full reboot and restore integrity.
> > > 
> > > > 
> > > > > This is a better solution since instead of disabling KEXEC outright: our
> > > > > driver made the best possible efforts to make kexec work.
> > > > > 
> > > > 
> > > > How an unrealiable feature leading to potential system crashes is better
> > > > that disabling kexec outright?
> > > 
> > > Because there are ways of using the feature reliably. What if someone
> > > has MSHV_ROOT enabled but never start a VM? (Just because someone has our
> > > driver enabled in the kernel doesn't mean they're using it.) What about crash
> > > dump?
> > > 
> > > It is far better to support some of these scenarios and be unreliable in
> > > some corner cases rather than disabling the feature completely.
> > > 
> > > Also, I'm curious if any other driver in the kernel has ever done this
> > > (force disable KEXEC).
> > > 
> > > > 
> > > > It's a complete opposite story for me: the latter provides a limited,
> > > > but robust functionality, while the former provides an unreliable and
> > > > unpredictable behavior.
> > > > 
> > > > > > 
> > > > > > There are two long-term solutions:
> > > > > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > > > > 
> > > > > I honestly think we should focus efforts on making kexec work rather
> > > > > than finding ways to prevent it.
> > > > > 
> > > > 
> > > > There is no argument about it. But until we have it fixed properly, we
> > > > have two options: either disable kexec or stop claiming we have our
> > > > driver up and ready for external customers. Giving the importance of
> > > > this driver for current projects, I believe the better way would be to
> > > > explicitly limit the functionality instead of postponing the
> > > > productization of the driver.
> > > 
> > > It is okay to claim our driver as ready even if it doesn't support all
> > > kexec cases. If we can support the common cases such as crash dump and
> > > maybe kexec based servicing (pretty sure people do systemctl kexec and
> > > not kexec -e for this with proper teardown) we can claim that our driver
> > > is ready for general use.
> > > 
> > > Thanks,
> > > Anirudh.