linux-kernel - RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157CAE4FA74E482A96471B1D416A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 18 Sep 2025 23:52:35 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Mukesh R <mrathor@...ux.microsoft.com>, "linux-hyperv@...r.kernel.org"
	<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-arch@...r.kernel.org"
	<linux-arch@...r.kernel.org>
CC: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
	<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>, "tglx@...utronix.de"
	<tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de"
	<bp@...en8.de>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
	"arnd@...db.de" <arnd@...db.de>
Subject: RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition
 from hypervisor

From: Mukesh R <mrathor@...ux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM
> 
> On 9/15/25 10:55, Michael Kelley wrote:
> > From: Mukesh Rathor <mrathor@...ux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> >>
> >> Introduce a small asm stub to transition from the hypervisor to linux
> >
> > I'd argue for capitalizing "Linux" here and in other places in commit
> > text and code comments throughout this patch set.
> 
> I'd argue against it. A quick grep indicates it is a common practice,
> and in the code world goes easy on the eyes :).

I'll offer a final comment on this topic, and then let it be. There's
a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson,
and other maintainers giving comments to use the capitalized form
of "Linux", "MSR", "RAM", etc. See:

https://lore.kernel.org/lkml/Y+4WHGNdWTZ5Hc6Y@kroah.com/
https://lore.kernel.org/lkml/86o7u0dqzj.wl-maz@kernel.org/
https://lore.kernel.org/lkml/408e68d0-1ae1-6d56-d008-61de14214326@linaro.org/
https://lore.kernel.org/lkml/20250819215304.GMaKTyQBWi6YzqZ0bW@fat_crate.local/
https://lore.kernel.org/lkml/Y0CAHch5UR2Lp0tU@google.com/
https://lore.kernel.org/lkml/20240126214336.GA453589@bhelgaas/
https://lore.kernel.org/lkml/20161117155543.vg3domfqm3bhp4f7@pd.tnic/

> 
> >> upon devirtualization.
> >
> > In this patch and subsequent patches, you've used the phrase "upon
> > devirtualization", which seems a little vague to me. Does this mean
> > "when devirtualization is complete" or perhaps "when the hypervisor
> > completes devirtualization"? Since there's no spec on any of this,
> > being as precise as possible will help future readers.
> 
> since control comes back to linux at the callback here, i fail to
> understand what is vague about it. when hyp completes devirt,
> devirt is complete.

To me, the word "upon" is less precise than just "after".  In temporal
contexts, "upon" might mean "at the same time as" or it might mean
"immediately after". I wrote this comment as I was trying to figure out
how the entire devirtualization process works. Eventually it became clear
and the ambiguity was resolved, but initially I was uncertain. See some
broader thoughts in my reply on Patch 5 of the series.

> 
> >>
> >> At a high level, during panic of either the hypervisor or the dom0 (aka
> >> root), the nmi handler asks hypervisor to devirtualize.
> >
> > Suggest:
> >
> > At a high level, during panic of either the hypervisor or Linux running
> > in dom0 (a.k.a. the root partition), the Linux NMI handler asks the
> > hypervisor to devirtualize.
> >
> >> As part of that,
> >> the arguments include an entry point to return back to linux. This asm
> >> stub implements that entry point.
> >>
> >> The stub is entered in protected mode, uses temporary gdt and page table
> >> to enable long mode and get to kernel entry point which then restores full
> >> kernel context to resume execution to kexec.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@...ux.microsoft.com>
> >> ---
> >>  arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++
> >>  1 file changed, 105 insertions(+)
> >>  create mode 100644 arch/x86/hyperv/hv_trampoline.S
> >>
> >> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S
> >> new file mode 100644
> >> index 000000000000..27a755401a42
> >> --- /dev/null
> >> +++ b/arch/x86/hyperv/hv_trampoline.S
> >> @@ -0,0 +1,105 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * X86 specific Hyper-V kdump/crash related code.
> >
> > Add a qualification that this is for root partition only, and not for
> > general guests?
> 
> i don't think it is needed, it would be odd for guests to collect hyp
> core. besides makefile/kconfig shows this is root vm only
> 
> >> + *
> >> + * Copyright (C) 2025, Microsoft, Inc.
> >> + *
> >> + */
> >> +#include <linux/linkage.h>
> >> +#include <asm/alternative.h>
> >> +#include <asm/msr.h>
> >> +#include <asm/processor-flags.h>
> >> +#include <asm/nospec-branch.h>
> >> +
> >> +/*
> >> + * void noreturn hv_crash_asm32(arg1)
> >> + *    arg1 == edi == 32bit PA of struct hv_crash_trdata
> >
> > I think this is "struct hv_crash_tramp_data".
> 
> correct
> 
> >> + *
> >> + * The hypervisor jumps here upon devirtualization in protected mode. This
> >> + * code gets copied to a page in the low 4G ie, 32bit space so it can run
> >> + * in the protected mode. Hence we cannot use any compile/link time offsets or
> >> + * addresses. It restores long mode via temporary gdt and page tables and
> >> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry.
> >> + *
> >> + * PreCondition (ie, Hypervisor call back ABI):
> >> + *  o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled
> >> + *  o CR4 is set to 0x0
> >> + *  o IA32_EFER is set to 0x901 (SCE and NXE are set)
> >> + *  o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX.
> >> + *  o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF
> >> + *  o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF
> >> + *  o LDTR is initialized as invalid (limit of 0)
> >> + *  o MSR PAT is power on default.
> >> + *  o Other state/registers are cleared. All TLBs flushed.
> >
> > Clarification about "Other state/registers are cleared":  What about
> > processor features that Linux may have enabled or disabled during its
> > initial boot? Are those still in the states Linux set? Or are they reset to
> > power-on defaults? For example, if Linux enabled x2apic, is x2apic
> > still enabled when the stub is entered?
> 
> correct, if linux set x2apic, x2apic would still be enabled.
> 
> >> + *
> >> + * See Intel SDM 10.8.5
> >
> > Hmmm. I downloaded the latest combined SDM, and section 10.8.5
> > in Volume 3A is about Microcode Update Resources, which doesn't
> > seem applicable here. Other volumes don't have a section 10.8.5.
> 
> google ai found it right away upon searching: intel sdm 10.8.5 ia-32e

Unfortunately, Intel doesn't necessarily maintain the section numbering
across revisions of the SDM. This web page:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

has a link to download the "Combined Volume Set", and currently provides
the version dated June 2025. The section "Initializing IA-32e Mode" is
numbered 11.8.5. The December 2024 version has the same 11.8.5
numbering. Are you finding an older version?

Presumably the section title is less likely to change unless Intel does a
major rewrite. So something like this would be more durable:

* See Intel SDM Volume 3A section "Initializing IA-32e Mode" (numbered
11.8.5 in the June 2025 version)

> 
> >> + */
> >> +
> >> +#define HV_CRASHDATA_OFFS_TRAMPCR3    0x0    /*	 0 */
> >> +#define HV_CRASHDATA_OFFS_KERNCR3     0x8    /*	 8 */
> >> +#define HV_CRASHDATA_OFFS_GDTRLIMIT  0x12    /* 18 */
> >> +#define HV_CRASHDATA_OFFS_CS_JMPTGT  0x28    /* 40 */
> >> +#define HV_CRASHDATA_OFFS_C_entry    0x30    /* 48 */
> >
> > It seems like these offsets should go in a #include file along
> > with the definition of struct hv_crash_tramp_data. Then the
> > BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could
> > check against these symbolic names instead of hardcoding
> > numbers that must match these.
> 
> yeah, that works too and was the first cut. but given the small
> number of these, and that they are not used/needed anywhere else,
> and that they will almost never change, creating another tiny header
> in a non-driver directory didn't seem worth it.. but i could go
> either way.
> 
> >> +#define HV_CRASHDATA_TRAMPOLINE_CS    0x8
> >
> > This #define isn't used anywhere.
> 
> removed
> 
> >> +
> >> +	.text
> >> +	.code32
> >> +
> >> +SYM_CODE_START(hv_crash_asm32)
> >> +	UNWIND_HINT_UNDEFINED
> >> +	ANNOTATE_NOENDBR
> >
> > No ENDBR here, presumably because this function is entered via other
> > than an indirect CALL or JMP instruction. Right?
> >
> >> +	movl	$X86_CR4_PAE, %ecx
> >> +	movl	%ecx, %cr4
> >> +
> >> +	movl %edi, %ebx
> >> +	add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx
> >> +	movl %cs:(%ebx), %eax
> >> +	movl %eax, %cr3
> >> +
> >> +	# Setup EFER for long mode now.
> >> +	movl	$MSR_EFER, %ecx
> >> +	rdmsr
> >> +	btsl	$_EFER_LME, %eax
> >> +	wrmsr
> >> +
> >> +	# Turn paging on using the temp 32bit trampoline page table.
> >> +	movl %cr0, %eax
> >> +	orl $(X86_CR0_PG), %eax
> >> +	movl %eax, %cr0
> >> +
> >> +	/* since kernel cr3 could be above 4G, we need to be in the long mode
> >> +	 * before we can load 64bits of the kernel cr3. We use a temp gdt for
> >> +	 * that with CS.L=1 and CS.D=0 */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax
> >> +	lgdtl %cs:(%eax)
> >> +
> >> +	/* not done yet, restore CS now to switch to CS.L=1 */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax
> >> +	ljmp %cs:*(%eax)
> >> +SYM_CODE_END(hv_crash_asm32)
> >> +
> >> +	/* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */
> >> +	.code64
> >> +	.balign 8
> >> +SYM_CODE_START(hv_crash_asm64)
> >> +	UNWIND_HINT_UNDEFINED
> >> +	ANNOTATE_NOENDBR
> >
> > But this *is* entered via an indirect JMP, right? So back to my
> > earlier question about the state of processor feature enablement.
> > If Linux enabled IBT, is it still enabled after devirtualization and
> > the hypervisor invokes this entry point? Linux guests on Hyper-V
> > have historically not enabled IBT, but patches that enable it are
> > now in linux-next, and will go into the 6.18 kernel. So maybe
> > this needs an ENDBR64.
> 
> IBT would be disabled in the transition here.... so doesn't really
> matter. ENDBR ok too..

So does Hyper-V explicitly disable IBT before making the callback?
Or is the IBT disabling somehow a processor side effect of going back
to protected mode? I don't see anything in the SDM about the latter.
Not having a Hyper-V spec for all this is frustrating ...

Doing the ENDBR64 here might be safer in the long run in case
we ever do end up here with IBT enabled.

> 
> >> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL)
> >> +	/* restore kernel page tables so we can jump to kernel code */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_KERNCR3, %eax
> >> +	movq %cs:(%eax), %rbx
> >> +	movq %rbx, %cr3
> >> +
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_C_entry, %eax
> >> +	movq %cs:(%eax), %rbx
> >> +	ANNOTATE_RETPOLINE_SAFE
> >> +	jmp *%rbx
> >> +
> >> +	int $3
> >> +
> >> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL)
> >> +SYM_CODE_END(hv_crash_asm64)
> >> --
> >> 2.36.1.vfs.0.0
> >>
> >