linux-kernel - Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aKYBeIokyVC8AKHe@google.com>
Date: Wed, 20 Aug 2025 10:10:16 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Amit Shah <amit@...nel.org>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	linux-doc@...r.kernel.org, amit.shah@....com, thomas.lendacky@....com, 
	bp@...en8.de, tglx@...utronix.de, peterz@...radead.org, jpoimboe@...nel.org, 
	pawan.kumar.gupta@...ux.intel.com, corbet@....net, mingo@...hat.com, 
	dave.hansen@...ux.intel.com, hpa@...or.com, pbonzini@...hat.com, 
	daniel.sneddon@...ux.intel.com, kai.huang@...el.com, sandipan.das@....com, 
	boris.ostrovsky@...cle.com, Babu.Moger@....com, david.kaplan@....com, 
	dwmw@...zon.co.uk, andrew.cooper3@...rix.com
Subject: Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests

On Thu, May 15, 2025, Amit Shah wrote:
> From: Amit Shah <amit.shah@....com>
> 
> AMD CPUs with the Enhanced Return Address Predictor (ERAPS) feature
> Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after
> VMEXITs.  The feature adds guest/host tags to entries in the RSB (a.k.a.
> RAP).  This helps with speculation protection across the VM boundary,
> and it also preserves host and guest entries in the RSB that can improve
> software performance (which would otherwise be flushed due to the
> FILL_RETURN_BUFFER sequences).  This feature also extends the size of
> the RSB from the older standard (of 32 entries) to a new default
> enumerated in CPUID leaf 0x80000021:EBX bits 23:16 -- which is 64
> entries in Zen5 CPUs.
> 
> In addition to flushing the RSB across VMEXIT boundaries, CPUs with
> this feature also flush the RSB when the CR3 is updated (i.e. whenever
> there's a context switch), to prevent one userspace process poisoning
> the RSB that may affect another process.  The relevance of this for KVM
> is explained below in caveat 2.
> 
> The hardware feature is always-on, and the host context uses the full
> default RSB size without any software changes necessary.  The presence
> of this feature allows software (both in host and guest contexts) to
> drop all RSB filling routines in favour of the hardware doing it.
> 
> For guests to observe and use this feature, 

Guests don't necessarily "use" this feature.  It's something that's enabled by
KVM and affects harware behavior regardless of whether or not the guest is even
aware ERAPS is a thing.

> the hypervisor needs to expose the CPUID bit, and also set a VMCB bit.
> Without one or both of those, 

No?  If there's no enabling for bare metal usage, I don't see how emulation of
CPUID can possibly impact usage of RAP size.  The only thing that matters is the
VMCB bit.  And nothing in this patch queries guest CPUID.

Observing ERAPS _might_ cause the guest to forego certain mitigations, but KVM
has zero visibility into whether or not such mitigations exist, if the guest will
care about ERAPS, etc.

> guests continue to use the older default RSB size and behaviour for backwards
> compatibility.  This means the hardware RSB size is limited to 32 entries for
> guests that do not have this feature exposed to them.
> 
> There are two guest/host configurations that need to be addressed before
> allowing a guest to use this feature: nested guests, and hosts using
> shadow paging (or when NPT is disabled):
> 
> 1. Nested guests: the ERAPS feature adds host/guest tagging to entries
>    in the RSB, but does not distinguish between the guest ASIDs.  To
>    prevent the case of an L2 guest poisoning the RSB to attack the L1
>    guest, the CPU exposes a new VMCB bit (FLUSH_RAP_ON_VMRUN).  The next
>    VMRUN with a VMCB that has this bit set causes the CPU to flush the
>    RSB before entering the guest context.  In this patch, we set the bit

"this patch", "we".

>    in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest
>    runs, its RSB contents aren't polluted by the L2's contents.
>    Similarly, when an exit from L1 to the hypervisor happens, we set
>    that bit for VMCB02, so that the L1 guest's RSB contents are not
>    leaked/used in the L2 context.
> 
> 2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
>    entries when the CR3 is updated.  When using shadow paging, CR3
>    updates within the guest do not update the CPU's CR3 register.

Yes they do, just indirectly.  KVM changes the effective CR3 in reaction to the
guest's new CR3.  If hardware doesn't flush in that situation, then it's trivially
easy to set ERAP_CONTROL_FLUSH_RAP on writes to CR3.

>    In this case, do not expose the ERAPS feature to guests,
>    and the guests continue with existing mitigations to fill the RSB.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 571c906ffcbf..0cca1865826e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1187,6 +1187,9 @@ void kvm_set_cpu_caps(void)
>  		F(SRSO_USER_KERNEL_NO),
>  	);
>  
> +	if (tdp_enabled)
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);

_If_ ERAPS is conditionally enabled, then it probably makes sense to do this in
svm_set_cpu_caps().  But I think we can just support ERAPS unconditionally.

> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a89c271a1951..a2b075ed4133 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1363,6 +1363,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
>  		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
>  
> +	if (boot_cpu_has(X86_FEATURE_ERAPS) && npt_enabled)
	
Don't regurgitate the same check in multiple places.

	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))


	
> +		vmcb_enable_extended_rap(svm->vmcb);
> +
>  	if (kvm_vcpu_apicv_active(vcpu))
>  		avic_init_vmcb(svm, vmcb);
>  
> @@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
>  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
>  	pr_err("%-20s%d\n", "asid:", control->asid);
>  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
>  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
>  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
>  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> @@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
>  
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
> +			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
> +			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
> +		}
> +
>  		vmexit = nested_svm_exit_special(svm);
>  
>  		if (vmexit == NESTED_EXIT_CONTINUE)
> @@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		if (vmexit == NESTED_EXIT_DONE)
>  			return 1;
> +	} else {
> +		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
> +			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
> +			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
> +		}

Handling this in the common exit path is confusing, inefficient, and lacking.

Assuming hardware doesn't automatically clear ERAP_CONTROL_FLUSH_RAP, then KVM
should clear the flag after _any_ exit, not just exits that reach this point,
e.g. if KVM stays in the fast path.

And IIUC, ERAP_CONTROL_FLUSH_RAP needs to be done on _every_ nested transition,
not just those that occur in direct response to a hardware #VMEXIT. So, hook
nested_vmcb02_prepare_control() for nested VMRUN and nested_svm_vmexit() for
nested #VMEXIT.

Side topic, the changelog should call out that KVM deliberately ignores guest
CPUID, and instead unconditionally enables the full size RAP when ERAPS is
supported.  I.e. KVM _could_ check guest_cpu_cap_has() instead of kvm_cpu_cap_has()
in all locations, to avoid having to flush the RAP on nested transitions when
ERAPS isn't enumerated to the guest, but presumably using the full size RAP is
better for overall performance.

The changelog should also call out that if the full size RAP is enabled, then
it's KVM's responsibility to flush the RAP on nested transitions irrespective
of whether or not ERAPS is advertised to the guest.  Because if ERAPS isn't
advertised, the the guest's mitigations will likely be insufficient.

>  	}
>  
>  	if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f16b068c4228..7f44f7c9b1d5 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -493,6 +493,26 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
>  	return vmcb_is_intercept(&svm->vmcb->control, bit);
>  }
>  
> +static inline void vmcb_set_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_clr_flush_guest_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
> +}
> +
> +static inline void vmcb_enable_extended_rap(struct vmcb *vmcb)
> +{
> +	vmcb->control.erap_ctl |= ERAP_CONTROL_ALLOW_LARGER_RAP;
> +}
> +
> +static inline bool vmcb_is_extended_rap(struct vmcb *vmcb)
> +{
> +	return !!(vmcb->control.erap_ctl & ERAP_CONTROL_ALLOW_LARGER_RAP);
> +}

Eh, just drop all of these wrappers.

With the caveat that I'm taking a wild guess on the !npt behavior, something
like this?

---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/svm.h         | 6 +++++-
 arch/x86/kvm/cpuid.c               | 7 ++++++-
 arch/x86/kvm/svm/nested.c          | 6 ++++++
 arch/x86/kvm/svm/svm.c             | 9 +++++++++
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index eb859299d514..87d9284c166a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -466,6 +466,7 @@
 #define X86_FEATURE_GP_ON_USER_CPUID	(20*32+17) /* User CPUID faulting */
 
 #define X86_FEATURE_PREFETCHI		(20*32+20) /* Prefetch Data/Instruction to Cache Level */
+#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced Return Address Predictor Security */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
 #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index ffc27f676243..58a079d6c3ef 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -131,7 +131,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 	u64 tsc_offset;
 	u32 asid;
 	u8 tlb_ctl;
-	u8 reserved_2[3];
+	u8 erap_ctl;
+	u8 reserved_2[2];
 	u32 int_ctl;
 	u32 int_vector;
 	u32 int_state;
@@ -182,6 +183,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define TLB_CONTROL_FLUSH_ASID 3
 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7
 
+#define ERAP_CONTROL_FULL_SIZE_RAP BIT(0)
+#define ERAP_CONTROL_FLUSH_RAP BIT(1)
+
 #define V_TPR_MASK 0x0f
 
 #define V_IRQ_SHIFT 8
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ad6cadf09930..184a810b53e3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void)
 		F(AUTOIBRS),
 		F(PREFETCHI),
 		EMULATED_F(NO_SMM_CTL_MSR),
+		F(ERAPS),
 		/* PrefetchCtlMsr */
 		/* GpOnUserCpuid */
 		/* EPSF */
@@ -1760,9 +1761,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
 		break;
 	case 0x80000021:
-		entry->ebx = entry->edx = 0;
+		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
 		cpuid_entry_override(entry, CPUID_8000_0021_ECX);
+		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+			entry->ebx &= GENMASK(23, 16);
+		else
+			entry->ebx = 0;
 		break;
 	/* AMD Extended Performance Monitoring and Debug */
 	case 0x80000022: {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index b7fd2e869998..77794fd809e1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -861,6 +861,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
 		}
 	}
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		vmcb02->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
+
 	/*
 	 * Merge guest and host intercepts - must be called with vcpu in
 	 * guest-mode to take effect.
@@ -1144,6 +1147,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
 	kvm_nested_vmexit_handle_ibrs(vcpu);
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		vmcb01->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
+
 	svm_switch_vmcb(svm, &svm->vmcb01);
 
 	/*
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7e7821ee8ee1..501596e56d39 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1205,6 +1205,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 		svm_clr_intercept(svm, INTERCEPT_PAUSE);
 	}
 
+	if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+		svm->vmcb->control.erap_ctl |= ERAP_CONTROL_FULL_SIZE_RAP;
+
 	if (kvm_vcpu_apicv_active(vcpu))
 		avic_init_vmcb(svm, vmcb);
 
@@ -2560,6 +2563,8 @@ static int cr_interception(struct kvm_vcpu *vcpu)
 			break;
 		case 3:
 			err = kvm_set_cr3(vcpu, val);
+			if (!err && nested && kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+				svm->vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP;
 			break;
 		case 4:
 			err = kvm_set_cr4(vcpu, val);
@@ -3322,6 +3327,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
 	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
 	pr_err("%-20s%d\n", "asid:", control->asid);
 	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
+	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
 	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
 	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
 	pr_err("%-20s%08x\n", "int_state:", control->int_state);
@@ -4366,6 +4372,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
 	}
 
 	svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+	if (cpu_feature_enabled(X86_FEATURE_ERAPS))
+		svm->vmcb->control.erap_ctl &= ~ERAP_CONTROL_FLUSH_RAP;
+
 	vmcb_mark_all_clean(svm->vmcb);
 
 	/* if exit due to PF check for async PF */

base-commit: 91b392ada892a2e8b1c621b9493c50f6fb49880f
--