[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYI4d0zPw3K5BedW@google.com>
Date: Tue, 3 Feb 2026 10:03:35 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on
nested #VMEXIT
On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > /*
> > * If L2 is active, defer delivery of the payload until the
> > * exception is actually injected to avoid clobbering state if
> > * L1 wants to intercept the exception (the architectural state
> > * is NOT updated if the exeption is morphed to a VM-Exit).
> > */
>
> It's not only about exceptions being morphed to a VM-Exit though, is it?
> KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> not injected, because from L1's perspective CR2 was updated but
> exit_int_info won't reflect a #PF. Right?
Right, but that's got nothing to do with L2 being active. Take nested completely
out of the picture, and the above statement holds true as well. "If a #PF is
pending but was not injected, then the guest shouldn't see a change in CR2".
> > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
> > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
> > VM-Exit.
> >
> > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
> > commit. Because invoking kvm_deliver_exception_payload() when the exception was
> > morphed to a VM-Exit is wrong. Oh, wait, this is the !exception_payload_enabled
> > case. So never mind, that's simply an unfixable bug, as the second comment alludes
> > to.
>
> Hmm for the #PF case I think delivering the payload is always wrong if
> it was morphed to a VM-Exit, regardless of exception_payload_enabled,
> because the payload should never reach CR2, right?
Right.
> Spoiler alert, there's another problem there. Even if the exception did
> not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the
> delivered payload is lost :/
>
> >
> > /*
> > * KVM's ABI only allows for one exception to be migrated. Luckily,
> > * the only time there can be two queued exceptions is if there's a
> > * non-exiting _injected_ exception, and a pending exiting exception.
> > * In that case, ignore the VM-Exiting exception as it's an extension
> > * of the injected exception.
> > */
> > if (vcpu->arch.exception_vmexit.pending &&
> > !vcpu->arch.exception.pending &&
> > !vcpu->arch.exception.injected)
> > ex = &vcpu->arch.exception_vmexit;
> > else
> > ex = &vcpu->arch.exception;
> >
> > /*
> > * In guest mode, payload delivery should be deferred if the exception
> > * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> > * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability,
> > * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> > * propagate the payload and so it cannot be safely deferred. Deliver
> > * the payload if the capability hasn't been requested.
> > */
> > if (!vcpu->kvm->arch.exception_payload_enabled &&
> > ex->pending && ex->has_payload)
> > kvm_deliver_exception_payload(vcpu, ex);
> >
> > So yeah, I _think_ we could drop the is_guest_mode() check. However, even better
> > would be to drop this call *entirely*, i.e.
>
> Hmm I don't think so, because as I mentioned above, KVM shouldn't update
> CR2 until the exception is actually injected, right?
Me confused. What you're saying is what I'm suggesting: don't update CR2 until
KVM actually stuffs the #PF into the VMCS/VMCB. Oh, or are you talking about the
first sentence? If so, strike that from the record, I was wrong (see below).
> It would actually be great to drop the is_guest_mode() check here but
> leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> and KVM_SET_SREGS goes away, and I *think* we can drop the
> kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events().
>
> The only problem would be CR2 getting updated without a fault being
> reflected in the vmcb12's exit_int_info AFAICT.
No, that particular case is a non-issue, because the code immediately above has
already verified that KVM will *not* morph the #PF to a nested VM-Exit. Note,
the "queue:" path is just for non-contributory exceptions and doesn't change the
VM-Exit change anyways.
/*
* If the exception is destined for L2, morph it to a VM-Exit if L1
* wants to intercept the exception.
*/
if (is_guest_mode(vcpu) &&
kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) { <====
kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
has_payload, payload);
return;
}
if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
queue:
vcpu->arch.exception.pending = true;
vcpu->arch.exception.injected = false;
vcpu->arch.exception.has_error_code = has_error;
vcpu->arch.exception.vector = nr;
vcpu->arch.exception.error_code = error_code;
vcpu->arch.exception.has_payload = has_payload;
vcpu->arch.exception.payload = payload;
if (!is_guest_mode(vcpu))
kvm_deliver_exception_payload(vcpu,
&vcpu->arch.exception);
return;
}
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b0112c515584..00a39c95631d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> > vcpu->arch.exception.error_code = error_code;
> > vcpu->arch.exception.has_payload = has_payload;
> > vcpu->arch.exception.payload = payload;
> > - if (!is_guest_mode(vcpu))
> > - kvm_deliver_exception_payload(vcpu,
> > - &vcpu->arch.exception);
> > return;
> > }
> >
> > Because KVM really shouldn't update CR2 until the excpetion is actually injected
> > (or the state is at risk of being lost because exception_payload_enabled==false).
> > E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
> > scenario that userspace deliberately drops a pending exception, arch state shouldn't
> > be updated.
>
> I think if we drop it there might be a problem. With
> exception_payload_enabled==false, pending exceptions becomes injected
> after save/restore. If the payload is not delivered here, then after
> restore we have an injected event with no payload.
>
> I guess the kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it
> only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS.
> Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT.
Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful
we wouldn't actually break anyone, but Firecracker at least gets sregs before
events. Of course Firecracker is already broken due to not enabling
KVM_CAP_EXCEPTION_PAYLOAD...
Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6.
If we're being _super_ pedantic, then delivering the payload anywhere but injection
or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without
KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong). E.g. very hypothetically, userspace
that saves/restores sregs but not vCPU events would see an unexpected CR2 change.
*sigh*
Ewwwwwwww. And we *definitely* don't want to drop the is_guest_mode() check,
because nested_vmx_update_pending_dbg() relies on the payload to still be valid.
Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception
payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs
to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected
gets recorded in vmcs12.
So, with all of that in mind, I believe the best we can do is fully defer delivery
of the exception until it's actually injected, and then apply the quirk to the
relevant GET APIs.
---
arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
1 file changed, 39 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0112c515584..e000521dfc8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
vcpu->arch.exception.error_code = error_code;
vcpu->arch.exception.has_payload = has_payload;
vcpu->arch.exception.payload = payload;
- if (!is_guest_mode(vcpu))
- kvm_deliver_exception_payload(vcpu,
- &vcpu->arch.exception);
return;
}
@@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
return 0;
}
-static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
- struct kvm_vcpu_events *events)
+static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
{
- struct kvm_queued_exception *ex;
-
- process_nmi(vcpu);
-
-#ifdef CONFIG_KVM_SMM
- if (kvm_check_request(KVM_REQ_SMI, vcpu))
- process_smi(vcpu);
-#endif
-
/*
* KVM's ABI only allows for one exception to be migrated. Luckily,
* the only time there can be two queued exceptions is if there's a
@@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
if (vcpu->arch.exception_vmexit.pending &&
!vcpu->arch.exception.pending &&
!vcpu->arch.exception.injected)
- ex = &vcpu->arch.exception_vmexit;
- else
- ex = &vcpu->arch.exception;
+ return &vcpu->arch.exception_vmexit;
+
+ return &vcpu->arch.exception;
+}
+
+static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
+{
+ struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
/*
- * In guest mode, payload delivery should be deferred if the exception
- * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
- * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability,
- * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
- * propagate the payload and so it cannot be safely deferred. Deliver
- * the payload if the capability hasn't been requested.
+ * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
+ * the pending exception payload when userspace saves *any* vCPU state
+ * that interacts with exception payloads to avoid breaking userspace.
+ *
+ * Architecturally, KVM must not deliver an exception payload until the
+ * exception is actually injected, e.g. to avoid losing pending #DB
+ * information (which VMX tracks in the VMCS), and to avoid clobbering
+ * state if the exception is never injected for whatever reason. But
+ * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
+ * may not propagate the payload across save+restore, and so KVM can't
+ * safely defer delivery of the payload.
*/
if (!vcpu->kvm->arch.exception_payload_enabled &&
ex->pending && ex->has_payload)
kvm_deliver_exception_payload(vcpu, ex);
+}
+
+static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
+ struct kvm_vcpu_events *events)
+{
+ struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
+
+ process_nmi(vcpu);
+
+#ifdef CONFIG_KVM_SMM
+ if (kvm_check_request(KVM_REQ_SMI, vcpu))
+ process_smi(vcpu);
+#endif
+
+ kvm_handle_exception_payload_quirk(vcpu);
memset(events, 0, sizeof(*events));
@@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
vcpu->arch.guest_state_protected)
return -EINVAL;
+ kvm_handle_exception_payload_quirk(vcpu);
+
memset(dbgregs, 0, sizeof(*dbgregs));
BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
@@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
if (vcpu->arch.guest_state_protected)
goto skip_protected_regs;
+ kvm_handle_exception_payload_quirk(vcpu);
+
kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
base-commit: 55671237401edd1ec59276b852b9361cc170915b
--
Powered by blists - more mailing lists