linux-kernel - Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aK7NIk1ArgQaDPHp@mun-amitshah-l>
Date: Wed, 27 Aug 2025 11:17:22 +0200
From: Amit Shah <amit@...nel.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org,
	linux-doc@...r.kernel.org, amit.shah@....com,
	thomas.lendacky@....com, bp@...en8.de, tglx@...utronix.de,
	peterz@...radead.org, jpoimboe@...nel.org,
	pawan.kumar.gupta@...ux.intel.com, corbet@....net, mingo@...hat.com,
	dave.hansen@...ux.intel.com, hpa@...or.com, pbonzini@...hat.com,
	daniel.sneddon@...ux.intel.com, kai.huang@...el.com,
	sandipan.das@....com, boris.ostrovsky@...cle.com,
	Babu.Moger@....com, david.kaplan@....com, dwmw@...zon.co.uk,
	andrew.cooper3@...rix.com
Subject: Re: [PATCH v5 1/1] x86: kvm: svm: set up ERAPS support for guests

On (Wed) 20 Aug 2025 [10:10:16], Sean Christopherson wrote:
> On Thu, May 15, 2025, Amit Shah wrote:

[...]

> > For guests to observe and use this feature, 
> 
> Guests don't necessarily "use" this feature.  It's something that's enabled by
> KVM and affects harware behavior regardless of whether or not the guest is even
> aware ERAPS is a thing.

OK wording it is tricky.  "use" in the sense of for the entire RSB to be
utilized within guest context.  Not "use" as in guest needs enablement or
needs to do anything special.

"For the extended size to also be utilized when the CPU is in guest context,
the hypervisor needs to..." ?

> > the hypervisor needs to expose the CPUID bit, and also set a VMCB bit.
> > Without one or both of those, 
> 
> No?  If there's no enabling for bare metal usage, I don't see how emulation of
> CPUID can possibly impact usage of RAP size.  The only thing that matters is the
> VMCB bit.  And nothing in this patch queries guest CPUID.

True.

> Observing ERAPS _might_ cause the guest to forego certain mitigations, but KVM
> has zero visibility into whether or not such mitigations exist, if the guest will
> care about ERAPS, etc.

Sure, there's nothing guest-specific about this; any OS, when it detects
ERAPS, may or may not want to adapt to its existence.  (As it turns out, for
Linux, no adaptation is necessary.)

> > guests continue to use the older default RSB size and behaviour for backwards
> > compatibility.  This means the hardware RSB size is limited to 32 entries for
> > guests that do not have this feature exposed to them.

[...]

> > 2. Hosts that disable NPT: the ERAPS feature also flushes the RSB
> >    entries when the CR3 is updated.  When using shadow paging, CR3
> >    updates within the guest do not update the CPU's CR3 register.
> 
> Yes they do, just indirectly.  KVM changes the effective CR3 in reaction to the
> guest's new CR3.  If hardware doesn't flush in that situation, then it's trivially
> easy to set ERAP_CONTROL_FLUSH_RAP on writes to CR3.

Yea, that's right - since it doesn't happen in-guest (i.e. there's an exit
instead), it needs KVM to set that bit.

[...]

> > @@ -3482,6 +3485,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
> >  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
> >  	pr_err("%-20s%d\n", "asid:", control->asid);
> >  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> > +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
> >  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
> >  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
> >  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> > @@ -3663,6 +3667,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> >  
> >  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
> >  
> > +		if (vmcb_is_extended_rap(svm->vmcb01.ptr)) {
> > +			vmcb_set_flush_guest_rap(svm->vmcb01.ptr);
> > +			vmcb_clr_flush_guest_rap(svm->nested.vmcb02.ptr);
> > +		}
> > +
> >  		vmexit = nested_svm_exit_special(svm);
> >  
> >  		if (vmexit == NESTED_EXIT_CONTINUE)
> > @@ -3670,6 +3679,11 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> >  
> >  		if (vmexit == NESTED_EXIT_DONE)
> >  			return 1;
> > +	} else {
> > +		if (vmcb_is_extended_rap(svm->vmcb01.ptr) && svm->nested.initialized) {
> > +			vmcb_set_flush_guest_rap(svm->nested.vmcb02.ptr);
> > +			vmcb_clr_flush_guest_rap(svm->vmcb01.ptr);
> > +		}
> 
> Handling this in the common exit path is confusing, inefficient, and lacking.

Heh, I agree.  I toyed with doing this just before VMRUN.  But I can't recall
why I disliked that more.

> Assuming hardware doesn't automatically clear ERAP_CONTROL_FLUSH_RAP, then KVM

That's right - it doesn't.

> should clear the flag after _any_ exit, not just exits that reach this point,
> e.g. if KVM stays in the fast path.

(or just before VMRUN).  Right.

> And IIUC, ERAP_CONTROL_FLUSH_RAP needs to be done on _every_ nested transition,
> not just those that occur in direct response to a hardware #VMEXIT. So, hook
> nested_vmcb02_prepare_control() for nested VMRUN and nested_svm_vmexit() for
> nested #VMEXIT.

Does sound better.  I think the case I wanted to preserve in this complex
logic was if we have a L2->exit->L2 transition, I didn't want to set the FLUSH
bit.

> Side topic, the changelog should call out that KVM deliberately ignores guest
> CPUID, and instead unconditionally enables the full size RAP when ERAPS is
> supported.  I.e. KVM _could_ check guest_cpu_cap_has() instead of kvm_cpu_cap_has()
> in all locations, to avoid having to flush the RAP on nested transitions when
> ERAPS isn't enumerated to the guest, but presumably using the full size RAP is
> better for overall performance.

Yea.

> The changelog should also call out that if the full size RAP is enabled, then
> it's KVM's responsibility to flush the RAP on nested transitions irrespective
> of whether or not ERAPS is advertised to the guest.  Because if ERAPS isn't
> advertised, the the guest's mitigations will likely be insufficient.

You mean the L2 guest?  ACK on the update.

> With the caveat that I'm taking a wild guess on the !npt behavior, something
> like this?

[...]

> +#define ERAP_CONTROL_FULL_SIZE_RAP BIT(0)
> +#define ERAP_CONTROL_FLUSH_RAP BIT(1)

Oh I def prefer to keep the APM-specified names.

[...]

Patch looks good!

I'll test it a bit and repost.

Thanks,

		Amit