linux-kernel - Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fybuz7v4zng2z5xktgqjluweplxx2hrn4j5ar57ea42e4fjwm6@5aemjmv6p4fv>
Date: Tue, 3 Feb 2026 19:54:02 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on
 nested #VMEXIT

On Tue, Feb 03, 2026 at 11:42:01AM -0800, Sean Christopherson wrote:
> On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote:
> > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > > 		/*
> > > > > 		 * If L2 is active, defer delivery of the payload until the
> > > > > 		 * exception is actually injected to avoid clobbering state if
> > > > > 		 * L1 wants to intercept the exception (the architectural state
> > > > > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > > > > 		 */
> > > > 
> > > > It's not only about exceptions being morphed to a VM-Exit though, is it?
> > > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> > > > not injected, because from L1's perspective CR2 was updated but
> > > > exit_int_info won't reflect a #PF. Right?
> > > 
> > > Right, but that's got nothing to do with L2 being active.  Take nested completely
> > > out of the picture, and the above statement holds true as well.  "If a #PF is
> > > pending but was not injected, then the guest shouldn't see a change in CR2".
> > 
> > Right, but it is still related to nested in a way. Ignore the exception
> > morphing to a VM-Exit, the case I am refering to is specifically
> > exit_int_info on SVM. IIUC, if there's an injected (but not intercepted)
> > exception when doing a nested VM-Exit, we have to propagate that to L1
> > (in nested_save_pending_event_to_vmcb12()), such that it can re-inject
> > that exception.
> 
> Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12().
> 
> As defined by kvm_queued_exception, that's not a *pending* event, it's an
> *injected* event.  In that case, the payload *should* have been delivered (to CR2
> or DR6) because that exception has already occurred (been "detected" in the SDM's
> weird wording).  The VM-Exit is not happening *before* the #PF, it's happening
> after the #PF is "detected", while the #PF is being vectored.
> 
> From a virtualization perspective, any other implementation is basically unworkable,
> as it would require the host to gain control after an exception is successfully
> vectored.  I.e. the absense of any mechanisms to support that effectively confirms
> that the CPU writes CR2 before attempting to deliver the exception to software.
> 
> > So what I was referring to is, if we write CR2 for a pending exception
> > to L2, and then exit to L1, L1 would perceive a chance in CR2 without an
> > ongoing #PF in exit_int_info. I believe the equivalent VMX function is
> > vmcs12_save_pending_event().
> 
> Also poorly named :-/
> 
> > All that to say, we should not deliver the payload of an exception to L2
> > before it's actually injected.
> 
> As above, those helpers deal with exceptions that have already been injected by
> KVM.
> 
> > > > It would actually be great to drop the is_guest_mode() check here but
> > > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> > > > and KVM_SET_SREGS goes away, and I *think* we can drop the
> > > > kvm_deliver_exception_payload() call in
> > > > kvm_vcpu_ioctl_x86_get_vcpu_events().
> > > >
> > > > The only problem would be CR2 getting updated without a fault being
> > > > reflected in the vmcb12's exit_int_info AFAICT.
> > > 
> > > No, that particular case is a non-issue, because the code immediately above has
> > > already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
> > > the "queue:" path is just for non-contributory exceptions and doesn't change the
> > > VM-Exit change anyways.
> > 
> > What I meant was not stuffing the #PF into the VMCB/VMCS because it's
> > intercepted, but the #PF being stuffed into exit_int_info or
> > idt_vectoring_info.
> > 
> > If we drop the guest mode check here, we could end up with CR2 updated
> > and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming
> > #PF is not intercepted).
> 
> No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully
> committed to delivering the exception to the guest.  If KVM cancels KVM_RUN, e.g.
> because of a pending signal from userspace to initiate save/restore, KVM calls
> kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected
> exception from the VMCS/VMCB back to vcpu->arch.exception.  Note that
> kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't
> track any payload, because the payload has already been delivered.
> 
> If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the
> in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM
> moves the exception back to vcpu->arch.exception via vmx_complete_interrupts()
> and svm_complete_interrupts() (which are also used for cancelling injection,
> because the logic is identical, only the VMCS/VMCB source differs).
> 
> For nested VM-Exit, KVM needs to emulate that behavior.  The exception has already
> been "detected" by KVM, and the payload has already been delivered, but a VM-Exit
> was encountered while vectoring the exception to software.
> 
> E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such
> that the first N pushes will hit page X, and the last M pushes will hit page X-1,
> and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!)
> see an updated CR2, with the first N pushes to vector the exception resident in
> page X.

We are not disagreeing, I think I may not have been clear. I agree with
all what you said.

What I was saying is basically delivering the payload for *pending*
exceptions for L2 is wrong (i.e. just removing the is_guest_mode() check
before calling kvm_deliver_exception_payload() in
kvm_multiple_exception()).

Exactly because of everything you said, the exception is not yet
*injected* and is *not* reflected to L1 in
exit_int_info/idt_vectoring_info if it's injection fails.

i.e what I said above:

	 If we drop the guest mode check here, we could end up with CR2
	 updated and a #PF not reflected in
	 exit_int_info/idt_vectoring_info (assuming #PF is not
	 intercepted).

We already established that removing the is_guest_mode() would be wrong,
so all is good here, I am just clarifying that what I meant initially is
that it would be wrong because we would update CR2 without actually
reflecting the #PF in exit_int_info/idt_vectoring_info, not because we
could update CR2 before a #PF intercept (as you mentioned, this part is
handled).

> 
> > > So, with all of that in mind, I believe the best we can do is fully defer delivery
> > > of the exception until it's actually injected, and then apply the quirk to the
> > > relevant GET APIs.
> > 
> > I think this should work. I can test it for the nested case, the way I
> > could reproduce the problem (with a VMM that does KVM_GET_SREGS before
> > KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by
> > intercepting and re-injecting all #PFs from L2, and then repeatedly
> > doing save+restore while L2 is doing some heavy lifting (building GCC).
> > This generates a lot of #PF exceptions to be saved+restored, and we
> > eventually get a segfault because of corrupted CR2 in L2.
> > 
> > Removing the is_guest_mode() check in kvm_multiple_exception() fixes it
> > by prematurely delivering the payload when it's queued. I think your fix
> > will also work by prematurely delivering the payload at save time. This
> > is actually more corect because at restore time the exception will
> > become injected and treated as such (e.g. shows up in exit_int_info).
> > 
> > Do you intend to send a patch? Or should I send it out (separate from
> > the current one) with you as the author?
> 
> I'll send a patch for this, there's a lot of historical information I want to
> capture.

Sounds good, I will test that it fixes the problem of losing the payload
of a nested exception when KVM_GET_SREGS preceeds KVM_GET_VCPU_EVENTS
(without KVM_CAP_EXCEPTION_PAYLOAD). Please CC me when you send it.

> 
> Can you send a v2 for _this_ patch, without the comment change?

Will do so shortly.