[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYO3AaBqbPy_9XdD@google.com>
Date: Wed, 4 Feb 2026 13:15:45 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] KVM: nSVM: Do not track EFER.SVME toggling in guest mode
The shortlog is *very* misleading. The changelog isn't much better. This isn't
just removing "tracking", it's redefining guest visible behavior and effectively
changing the KVM virtual CPU microarchitecture.
On Fri, Jan 30, 2026, Yosry Ahmed wrote:
> KVM tracks when EFER.SVME is set and cleared to initialize and tear down
> nested state. However, it doesn't differentiate if EFER.SVME is getting
> toggled in L1 or L2+. Toggling EFER.SVME in L2+ is inconsequential from
> KVM's perspective, as the vCPU is still obviously using nested.
>
> This causes a problem if L2 sets then clears EFER.SVME without L1
> interception, as KVM exits guest mode and tears down nested state while
> L2 is running, executing L1 without injecting a proper #VMEXIT.
>
> Technically, it's not a bug as the APM states that an L1 hypervisor
> should intercept EFER writes:
>
> The effect of turning off EFER.SVME while a guest is running is
> undefined; therefore, the VMM should always prevent guests from
> writing EFER.
>
> However, it would be nice if KVM handled it more gracefully.
That's not sufficient justification for this change. Nothing will ever trip this
code unless it's _trying_ to trip this code. I.e. Outside of a bespoke selftest
that is little more than "make sure the kernel doesn't explode", and future
malicious actors, KVM's behavior is largely irrelevant.
> Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> ---
> arch/x86/kvm/svm/svm.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 4575a6a7d6c4e..eaf0f8053fbfb 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -208,6 +208,13 @@ static int svm_set_efer_svme(struct kvm_vcpu *vcpu, u64 old_efer, u64 new_efer)
> if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME))
> return 0;
>
> + /*
> + * An L2 guest setting or clearing EFER_SVME does not change whether or
> + * not the vCPU can use nested from KVM's perspective.
This should call out that the architectural behavior is undefined. "from KVM's
perspective" is an obtuse way of saying "KVM is making up behavior because it
can". E.g. something like
/*
* Architecturally, clearing EFER.SVME while a guest is running yields
* undefined behavior, i.e. KVM has carte blance to do anything if L1
* doesn't intercept writes to EFER. Simply do nothing, because XYZ.
*/
> + */
> + if (is_guest_mode(vcpu))
This is fine, because svm_allocate_nested() plays nice with redundant calls, but
this is a rather scary change for something that straight up doesn't happen in
practice. Any hypervisor that doesn't intercept EFER is broken, period.
E.g. if a future change adds novel code that's guarded by the
if ((old_efer & EFER_SVME) == (new_efer & EFER_SVME))
return 0;
check, then doing nothing here could result in a guest-exploitable bug. In other
words, from a kernel safety perspective, "doing nothing" is more dangerous than
forcibly leaving nested mode, because doing nothing deliberately puts KVM in an
inconsistent state. Given that there's basically zero benefit in practice, I'm
strongly inclined to keep the call svm_leave_nested().
All that said, I agree that pulling the rug out from under the VM is a terrible
experience. What if we throw a triple fault at the vCPU so that L1 gets an
immediate SHUTDOWN (not a VM-Exit, a SHUTDOWN of the L1 vCPU), instead of running
random garbage from L2?
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5f0136dbdde6..ccd73a3be3f9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -216,6 +216,17 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
if (!(efer & EFER_SVME)) {
+ /*
+ * Architecturally, clearing EFER.SVME while a guest is
+ * running yields undefined behavior, i.e. KVM can do
+ * literally anything. Force the vCPU back into L1 as
+ * that is the safest option for KVM, but synthesize a
+ * triple fault (for L1!) so that KVM at least doesn't
+ * run random L2 code in the context of L1.
+ */
+ if (is_guest_mode(vcpu))
+ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
svm_leave_nested(vcpu);
/* #GP intercept is still needed for vmware backdoor */
if (!enable_vmware_backdoor)
> + return 0;
> +
> if (new_efer & EFER_SVME) {
> r = svm_allocate_nested(svm);
> if (r)
> --
> 2.53.0.rc1.225.gd81095ad13-goog
>
Powered by blists - more mailing lists