linux-kernel - Re: [RFC PATCH v1 1/5] x86/boot: Shift VMXON from KVM init to CPU startup phase

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1301b802284ed5755fe397f54e1de41638aec49c.camel@intel.com>
Date: Wed, 10 Sep 2025 08:02:46 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-pm@...r.kernel.org"
	<linux-pm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "xin@...or.com" <xin@...or.com>
CC: "brgerst@...il.com" <brgerst@...il.com>, "andrew.cooper3@...rix.com"
	<andrew.cooper3@...rix.com>, "arjan@...ux.intel.com" <arjan@...ux.intel.com>,
	"Williams, Dan J" <dan.j.williams@...el.com>, "dave.hansen@...ux.intel.com"
	<dave.hansen@...ux.intel.com>, "Edgecombe, Rick P"
	<rick.p.edgecombe@...el.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"mingo@...hat.com" <mingo@...hat.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"hpa@...or.com" <hpa@...or.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "pavel@...nel.org" <pavel@...nel.org>, "bp@...en8.de"
	<bp@...en8.de>, "kprateek.nayak@....com" <kprateek.nayak@....com>,
	"rafael@...nel.org" <rafael@...nel.org>, "david.kaplan@....com"
	<david.kaplan@....com>, "x86@...nel.org" <x86@...nel.org>, "Gao, Chao"
	<chao.gao@...el.com>
Subject: Re: [RFC PATCH v1 1/5] x86/boot: Shift VMXON from KVM init to CPU
 startup phase

On Tue, 2025-09-09 at 11:28 -0700, Xin Li (Intel) wrote:
> Move the VMXON setup from the KVM initialization path to the CPU startup
> phase to guarantee that hardware virtualization is enabled early and
> without interruption.
> 
> As a result, KVM, often loaded as a kernel module, no longer needs to worry
> about whether or not VMXON has been executed on a CPU (e.g., CPU offline
> events or system reboots while KVM is loading).

KVM has a module parameter 'enable_virt_at_load', which controls whether
to enable virtualization (in case of VMX, VMXON) when loading KVM or defer
the enabling until the first VM is created.

Changing to unconditionally do VMXON when bringing up the CPU will kinda
break this.  Maybe eventually, we might switch to unconditionally VMXON,
but now it seems a dramatic move.

I was thinking the code change would be the core kernel only provides the
VMXON/OFF APIs for KVM (and other kernel components to use, i.e., more
like "moving" VMX code out of KVM. 
 

[...]

> +static bool is_vmx_supported(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +
> +	if (!(cpuid_ecx(1) & (1 << (X86_FEATURE_VMX & 31)))) {
> +		/* May not be an Intel CPU */
> +		pr_info("VMX not supported by CPU%d\n", cpu);
> +		return false;
> +	}
> +
> +	if (!this_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
> +	    !this_cpu_has(X86_FEATURE_VMX)) {
> +		pr_err("VMX not enabled (by BIOS) in MSR_IA32_FEAT_CTL on CPU%d\n", cpu);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
> +union vmxon_vmcs {
> +	struct vmcs_hdr hdr;
> +	char data[PAGE_SIZE];
> +};
> +
> +static DEFINE_PER_CPU_PAGE_ALIGNED(union vmxon_vmcs, vmxon_vmcs);
> +
> +/*
> + * Executed during the CPU startup phase to execute VMXON to enable VMX. This
> + * ensures that KVM, often loaded as a kernel module, no longer needs to worry
> + * about whether or not VMXON has been executed on a CPU (e.g., CPU offline
> + * events or system reboots while KVM is loading).
> + *
> + * VMXON is not expected to fault, but fault handling is kept as a precaution
> + * against any unexpected code paths that might trigger it and can be removed
> + * later if unnecessary.
> + */
> +void cpu_enable_virtualization(void)
> +{
> +	u64 vmxon_pointer = __pa(this_cpu_ptr(&vmxon_vmcs));
> +	int cpu = raw_smp_processor_id();
> +	u64 basic_msr;
> +
> +	if (!is_vmx_supported())
> +		return;
> +
> +	if (cr4_read_shadow() & X86_CR4_VMXE) {
> +		pr_err("VMX already enabled on CPU%d\n", cpu);
> +		return;
> +	}
> +
> +	memset(this_cpu_ptr(&vmxon_vmcs), 0, PAGE_SIZE);
> +
> +	/*
> +	 * Even though not explicitly documented by TLFS, VMXArea passed as
> +	 * VMXON argument should still be marked with revision_id reported by
> +	 * physical CPU.
> +	 */
> +	rdmsrq(MSR_IA32_VMX_BASIC, basic_msr);
> +	this_cpu_ptr(&vmxon_vmcs)->hdr.revision_id = vmx_basic_vmcs_revision_id(basic_msr);
> +
> +	intel_pt_handle_vmx(1);
> +
> +	cr4_set_bits(X86_CR4_VMXE);
> +
> +	asm goto("1: vmxon %[vmxon_pointer]\n\t"
> +		 _ASM_EXTABLE(1b, %l[fault])
> +		 : : [vmxon_pointer] "m"(vmxon_pointer)
> +		 : : fault);
> +
> +	return;
> +
> +fault:
> +	pr_err("VMXON faulted on CPU%d\n", cpu);
> +	cr4_clear_bits(X86_CR4_VMXE);
> +	intel_pt_handle_vmx(0);
> +}
> +
>  /*
>   * This does the hard work of actually picking apart the CPU stuff...
>   */
> @@ -2120,6 +2199,12 @@ void identify_secondary_cpu(unsigned int cpu)
>  
>  	tsx_ap_init();
>  	c->initialized = true;
> +
> +	/*
> +	 * Enable AP virtualization immediately after initializing the per-CPU
> +	 * cpuinfo_x86 structure, ensuring that this_cpu_has() operates correctly.
> +	 */
> +	cpu_enable_virtualization();
>  }

AFAICT here there's a functional drawback, that this implementation
doesn't handle VMXON failure while the existing KVM code does via a CPUHP
callback.

>  
>  void print_cpu_info(struct cpuinfo_x86 *c)
> @@ -2551,6 +2636,12 @@ void __init arch_cpu_finalize_init(void)
>  	*c = boot_cpu_data;
>  	c->initialized = true;
>  
> +	/*
> +	 * Enable BSP virtualization right after the BSP cpuinfo_x86 structure
> +	 * is initialized to ensure this_cpu_has() works as expected.
> +	 */
> +	cpu_enable_virtualization();
> +
> 

Any reason that you choose to do it in arch_cpu_finalize_init()?  Perhaps
just a arch_initcall() or similar?

KVM has a specific CPUHP_AP_KVM_ONLINE to handle VMXON/OFF for CPU
online/offline.  And it's not in STARTUP section (which is not allowed to
fail) so it can handle the failure of VMXON.

How about adding a VMX specific CPUHP callback instead?

In this way, not only we can put all VMX related code together (e.g.,
arch/x86/virt/vmx/vmx.c) which is way easier to review/maintain, but also
we can still handle the failure of VMXON just like in KVM.

(btw, originally KVM's CPUHP callback was also in STARTUP section, but
later we changed to after that in order to handle VMXON failure and
compatibility check failure gracefully.)

[...]

> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index 916441f5e85c..0eec314b79c2 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -206,11 +206,11 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
>  	/* cr4 was introduced in the Pentium CPU */
>  #ifdef CONFIG_X86_32
>  	if (ctxt->cr4)
> -		__write_cr4(ctxt->cr4);
> +		__write_cr4(ctxt->cr4 & ~X86_CR4_VMXE);
>  #else
>  /* CONFIG X86_64 */
>  	wrmsrq(MSR_EFER, ctxt->efer);
> -	__write_cr4(ctxt->cr4);
> +	__write_cr4(ctxt->cr4 & ~X86_CR4_VMXE);
>  #endif
>  	write_cr3(ctxt->cr3);
>  	write_cr2(ctxt->cr2);
> @@ -291,6 +291,9 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
>  	 * because some of the MSRs are "emulated" in microcode.
>  	 */
>  	msr_restore_context(ctxt);
> +
> +	if (ctxt->cr4 & X86_CR4_VMXE)
> +		cpu_enable_virtualization();
>  }
> 

If we still leverage what KVM is doing -- using syscore_ops callback -- I
think we can avoid changing this function but keep all VMX code in a
dedicated file.