linux-kernel - Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <wnh3ghsxxml32sldkm4qzlzre7nebor3oqtj6i7mlhqj2gwzys@o5w5rpzrhhc4>
Date: Tue, 3 Feb 2026 10:34:28 +0530
From: Anirudh Rayabharam <anirudh@...rudhrb.com>
To: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org, 
	decui@...rosoft.com, longli@...rosoft.com, linux-hyperv@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC

On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > 
> > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > management is implemented.
> > > > > > > > > > 
> > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > > 
> > > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > > right? What other deposited pages would be left?
> > > > > > > > 
> > > > > > > 
> > > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > withdrawn).
> > > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > > host partition.
> > > > > > 
> > > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > irrespective of userspace behavior?
> > > > > > 
> > > > > 
> > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > 
> > > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > > may still crash.
> > > > 
> > > > Actually guests won't be running by the time we reach our module_exit
> > > > function during a kexec. Userspace processes would've been killed by
> > > > then.
> > > > 
> > > 
> > > No, they will not: "kexec -e" doesn't kill user processes.
> > > We must not rely on OS to do graceful shutdown before doing
> > > kexec.
> > 
> > I see kexec -e is too brutal. Something like systemctl kexec is
> > more graceful and is probably used more commonly. In this case at least
> > we could register a reboot notifier and attempt to clean things up.
> > 
> > I think it is better to support kexec to this extent rather than
> > disabling it entirely.
> > 
> 
> You do understand that once our kernel is released to third parties, we
> can’t control how they will use kexec, right?

Yes, we can't. But that's okay. It is fine for us to say that only some
kexec scenarios are supported and some aren't (iff you're creating VMs
using MSHV; if you're not creating VMs all of kexec is supported).

> 
> This is a valid and existing option. We have to account for it. Yet
> again, L1VH will be used by arbitrary third parties out there, not just
> by us.
> 
> We can’t say the kernel supports MSHV until we close these gaps. We must

We can. It is okay say some scenarios are supported and some aren't.

All kexecs are supported if they never create VMs using MSHV. If they do
create VMs using MSHV and we implement cleanup in a reboot notifier at
least systemctl kexec and crashdump kexec would which are probably the
most common uses of kexec. It's okay to say that this is all we support
as of now.

Also, what makes you think customers would even be interested in enabling
our module in their kernel configs if it takes away kexec?

Thanks,
Anirudh.

> not depend on user space to keep the kernel safe.
> 
> Do you agree?
> 
> Thanks,
> Stanislav
> 
> > > 
> > > > Also, why is this sloppy? Isn't this what module_exit should be
> > > > doing anyway? If someone unloads our module we should be trying to
> > > > clean everything up (including killing guests) and reclaim memory.
> > > > 
> > > 
> > > Kexec does not unload modules, but it doesn't really matter even if it
> > > would.
> > > There are other means to plug into the reboot flow, but neither of them
> > > is robust or reliable.
> > > 
> > > > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > > > stop the kexec.
> > > > 
> > > 
> > > By killing the whole system? This is not a good user experience and I
> > > don't see how can this be justified.
> > 
> > It is justified because, as you said, once we reach that failure we can
> > no longer guarantee integrity. So BUG() makes sense. This BUG() would
> > cause the system to go for a full reboot and restore integrity.
> > 
> > > 
> > > > This is a better solution since instead of disabling KEXEC outright: our
> > > > driver made the best possible efforts to make kexec work.
> > > > 
> > > 
> > > How an unrealiable feature leading to potential system crashes is better
> > > that disabling kexec outright?
> > 
> > Because there are ways of using the feature reliably. What if someone
> > has MSHV_ROOT enabled but never start a VM? (Just because someone has our
> > driver enabled in the kernel doesn't mean they're using it.) What about crash
> > dump?
> > 
> > It is far better to support some of these scenarios and be unreliable in
> > some corner cases rather than disabling the feature completely.
> > 
> > Also, I'm curious if any other driver in the kernel has ever done this
> > (force disable KEXEC).
> > 
> > > 
> > > It's a complete opposite story for me: the latter provides a limited,
> > > but robust functionality, while the former provides an unreliable and
> > > unpredictable behavior.
> > > 
> > > > > 
> > > > > There are two long-term solutions:
> > > > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > > > 
> > > > I honestly think we should focus efforts on making kexec work rather
> > > > than finding ways to prevent it.
> > > > 
> > > 
> > > There is no argument about it. But until we have it fixed properly, we
> > > have two options: either disable kexec or stop claiming we have our
> > > driver up and ready for external customers. Giving the importance of
> > > this driver for current projects, I believe the better way would be to
> > > explicitly limit the functionality instead of postponing the
> > > productization of the driver.
> > 
> > It is okay to claim our driver as ready even if it doesn't support all
> > kexec cases. If we can support the common cases such as crash dump and
> > maybe kexec based servicing (pretty sure people do systemctl kexec and
> > not kexec -e for this with proper teardown) we can claim that our driver
> > is ready for general use.
> > 
> > Thanks,
> > Anirudh.