linux-kernel - Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYI4d0zPw3K5BedW@google.com>
Date: Tue, 3 Feb 2026 10:03:35 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on
 nested #VMEXIT

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > 		/*
> > 		 * If L2 is active, defer delivery of the payload until the
> > 		 * exception is actually injected to avoid clobbering state if
> > 		 * L1 wants to intercept the exception (the architectural state
> > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > 		 */
> 
> It's not only about exceptions being morphed to a VM-Exit though, is it?
> KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> not injected, because from L1's perspective CR2 was updated but
> exit_int_info won't reflect a #PF. Right?

Right, but that's got nothing to do with L2 being active.  Take nested completely
out of the picture, and the above statement holds true as well.  "If a #PF is
pending but was not injected, then the guest shouldn't see a change in CR2".

> > But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
> > VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
> > VM-Exit.
> > 
> > Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
> > commit.  Because invoking kvm_deliver_exception_payload() when the exception was
> > morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
> > case.  So never mind, that's simply an unfixable bug, as the second comment alludes
> > to.
> 
> Hmm for the #PF case I think delivering the payload is always wrong if
> it was morphed to a VM-Exit, regardless of exception_payload_enabled,
> because the payload should never reach CR2, right?

Right.

> Spoiler alert, there's another problem there. Even if the exception did
> not morph to a VM-Exit, if userspace already did KVM_GET_SREGS then the
> delivered payload is lost :/
> 
> > 
> > 	/*
> > 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
> > 	 * the only time there can be two queued exceptions is if there's a
> > 	 * non-exiting _injected_ exception, and a pending exiting exception.
> > 	 * In that case, ignore the VM-Exiting exception as it's an extension
> > 	 * of the injected exception.
> > 	 */
> > 	if (vcpu->arch.exception_vmexit.pending &&
> > 	    !vcpu->arch.exception.pending &&
> > 	    !vcpu->arch.exception.injected)
> > 		ex = &vcpu->arch.exception_vmexit;
> > 	else
> > 		ex = &vcpu->arch.exception;
> > 
> > 	/*
> > 	 * In guest mode, payload delivery should be deferred if the exception
> > 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> > 	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> > 	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> > 	 * propagate the payload and so it cannot be safely deferred.  Deliver
> > 	 * the payload if the capability hasn't been requested.
> > 	 */
> > 	if (!vcpu->kvm->arch.exception_payload_enabled &&
> > 	    ex->pending && ex->has_payload)
> > 		kvm_deliver_exception_payload(vcpu, ex);
> > 
> > So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
> > would be to drop this call *entirely*, i.e.
> 
> Hmm I don't think so, because as I mentioned above, KVM shouldn't update
> CR2 until the exception is actually injected, right?

Me confused.  What you're saying is what I'm suggesting: don't update CR2 until
KVM actually stuffs the #PF into the VMCS/VMCB.  Oh, or are you talking about the
first sentence?  If so, strike that from the record, I was wrong (see below).

> It would actually be great to drop the is_guest_mode() check here but
> leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> and KVM_SET_SREGS goes away, and I *think* we can drop the
> kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events().
>
> The only problem would be CR2 getting updated without a fault being
> reflected in the vmcb12's exit_int_info AFAICT.

No, that particular case is a non-issue, because the code immediately above has
already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
the "queue:" path is just for non-contributory exceptions and doesn't change the
VM-Exit change anyways.

	/*
	 * If the exception is destined for L2, morph it to a VM-Exit if L1
	 * wants to intercept the exception.
	 */
	if (is_guest_mode(vcpu) &&
	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {  <====
		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
					   has_payload, payload);
		return;
	}

	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
	queue:
		vcpu->arch.exception.pending = true;
		vcpu->arch.exception.injected = false;

		vcpu->arch.exception.has_error_code = has_error;
		vcpu->arch.exception.vector = nr;
		vcpu->arch.exception.error_code = error_code;
		vcpu->arch.exception.has_payload = has_payload;
		vcpu->arch.exception.payload = payload;
		if (!is_guest_mode(vcpu))
			kvm_deliver_exception_payload(vcpu,
						      &vcpu->arch.exception);
		return;
	}

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b0112c515584..00a39c95631d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> >                 vcpu->arch.exception.error_code = error_code;
> >                 vcpu->arch.exception.has_payload = has_payload;
> >                 vcpu->arch.exception.payload = payload;
> > -               if (!is_guest_mode(vcpu))
> > -                       kvm_deliver_exception_payload(vcpu,
> > -                                                     &vcpu->arch.exception);
> >                 return;
> >         }
> >  
> > Because KVM really shouldn't update CR2 until the excpetion is actually injected
> > (or the state is at risk of being lost because exception_payload_enabled==false).
> > E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
> > scenario that userspace deliberately drops a pending exception, arch state shouldn't
> > be updated.
> 
> I think if we drop it there might be a problem. With
> exception_payload_enabled==false, pending exceptions becomes injected
> after save/restore. If the payload is not delivered here, then after
> restore we have an injected event with no payload.
> 
> I guess the kvm_deliver_exception_payload() call in
> kvm_vcpu_ioctl_x86_get_vcpu_events() is supposed to handle this, but it
> only works if userspace does KVM_GET_SREGS *after* KVM_GET_VCPU_EVENTS.
> Removing the call here will regress VMM's doing KVM_GET_SREGS AFAICT.

Drat, QEMU does KVM_GET_VCPU_EVENTS before KVM_GET_SREGS{,2}, so I was hopeful
we wouldn't actually break anyone, but Firecracker at least gets sregs before
events.  Of course Firecracker is already broken due to not enabling
KVM_CAP_EXCEPTION_PAYLOAD...

Ugh, and it's not just KVM_GET_SREGS, it's also KVM_GET_DEBUGREGS thanks to DR6.

If we're being _super_ pedantic, then delivering the payload anywhere but injection
or KVM_GET_VCPU_EVENTS is "wrong" (well, "more wrong", since any behavior without
KVM_CAP_EXCEPTION_PAYLOAD is inherently wrong).  E.g. very hypothetically, userspace
that saves/restores sregs but not vCPU events would see an unexpected CR2 change.

*sigh*

Ewwwwwwww.  And we *definitely* don't want to drop the is_guest_mode() check,
because nested_vmx_update_pending_dbg() relies on the payload to still be valid.

Ah, right, and that's also why commit a06230b62b89 ("KVM: x86: Deliver exception
payload on KVM_GET_VCPU_EVENTS") from MTF series deferred the update: KVM needs
to keep the #DB pending so that a VM-Exit that occurs before the #DB is injected
gets recorded in vmcs12.

So, with all of that in mind, I believe the best we can do is fully defer delivery
of the exception until it's actually injected, and then apply the quirk to the
relevant GET APIs.

---
 arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0112c515584..e000521dfc8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
 		vcpu->arch.exception.error_code = error_code;
 		vcpu->arch.exception.has_payload = has_payload;
 		vcpu->arch.exception.payload = payload;
-		if (!is_guest_mode(vcpu))
-			kvm_deliver_exception_payload(vcpu,
-						      &vcpu->arch.exception);
 		return;
 	}
 
@@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
-					       struct kvm_vcpu_events *events)
+static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
 {
-	struct kvm_queued_exception *ex;
-
-	process_nmi(vcpu);
-
-#ifdef CONFIG_KVM_SMM
-	if (kvm_check_request(KVM_REQ_SMI, vcpu))
-		process_smi(vcpu);
-#endif
-
 	/*
 	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
 	 * the only time there can be two queued exceptions is if there's a
@@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	if (vcpu->arch.exception_vmexit.pending &&
 	    !vcpu->arch.exception.pending &&
 	    !vcpu->arch.exception.injected)
-		ex = &vcpu->arch.exception_vmexit;
-	else
-		ex = &vcpu->arch.exception;
+		return &vcpu->arch.exception_vmexit;
+
+	return &vcpu->arch.exception;
+}
+
+static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
+{
+	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
 
 	/*
-	 * In guest mode, payload delivery should be deferred if the exception
-	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
-	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
-	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
-	 * propagate the payload and so it cannot be safely deferred.  Deliver
-	 * the payload if the capability hasn't been requested.
+	 * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
+	 * the pending exception payload when userspace saves *any* vCPU state
+	 * that interacts with exception payloads to avoid breaking userspace.
+	 *
+	 * Architecturally, KVM must not deliver an exception payload until the
+	 * exception is actually injected, e.g. to avoid losing pending #DB
+	 * information (which VMX tracks in the VMCS), and to avoid clobbering
+	 * state if the exception is never injected for whatever reason.  But
+	 * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
+	 * may not propagate the payload across save+restore, and so KVM can't
+	 * safely defer delivery of the payload.
 	 */
 	if (!vcpu->kvm->arch.exception_payload_enabled &&
 	    ex->pending && ex->has_payload)
 		kvm_deliver_exception_payload(vcpu, ex);
+}
+
+static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
+					       struct kvm_vcpu_events *events)
+{
+	struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
+
+	process_nmi(vcpu);
+
+#ifdef CONFIG_KVM_SMM
+	if (kvm_check_request(KVM_REQ_SMI, vcpu))
+		process_smi(vcpu);
+#endif
+
+	kvm_handle_exception_payload_quirk(vcpu);
 
 	memset(events, 0, sizeof(*events));
 
@@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
 	    vcpu->arch.guest_state_protected)
 		return -EINVAL;
 
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	memset(dbgregs, 0, sizeof(*dbgregs));
 
 	BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
@@ -12123,6 +12137,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	if (vcpu->arch.guest_state_protected)
 		goto skip_protected_regs;
 
+	kvm_handle_exception_payload_quirk(vcpu);
+
 	kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
 	kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);

base-commit: 55671237401edd1ec59276b852b9361cc170915b
--