linux-kernel - Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYJPiVICpb3R6Cj_@google.com>
Date: Tue, 3 Feb 2026 11:42:01 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on
 nested #VMEXIT

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> On Tue, Feb 03, 2026 at 10:03:35AM -0800, Sean Christopherson wrote:
> > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > On Tue, Feb 03, 2026 at 08:12:30AM -0800, Sean Christopherson wrote:
> > > > On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> > > > 		/*
> > > > 		 * If L2 is active, defer delivery of the payload until the
> > > > 		 * exception is actually injected to avoid clobbering state if
> > > > 		 * L1 wants to intercept the exception (the architectural state
> > > > 		 * is NOT updated if the exeption is morphed to a VM-Exit).
> > > > 		 */
> > > 
> > > It's not only about exceptions being morphed to a VM-Exit though, is it?
> > > KVM should not update the payload (e.g. CR2) if a #PF is pending but was
> > > not injected, because from L1's perspective CR2 was updated but
> > > exit_int_info won't reflect a #PF. Right?
> > 
> > Right, but that's got nothing to do with L2 being active.  Take nested completely
> > out of the picture, and the above statement holds true as well.  "If a #PF is
> > pending but was not injected, then the guest shouldn't see a change in CR2".
> 
> Right, but it is still related to nested in a way. Ignore the exception
> morphing to a VM-Exit, the case I am refering to is specifically
> exit_int_info on SVM. IIUC, if there's an injected (but not intercepted)
> exception when doing a nested VM-Exit, we have to propagate that to L1
> (in nested_save_pending_event_to_vmcb12()), such that it can re-inject
> that exception.

Ugh, that's a poor choice of name for nested_save_pending_event_to_vmcb12().

As defined by kvm_queued_exception, that's not a *pending* event, it's an
*injected* event.  In that case, the payload *should* have been delivered (to CR2
or DR6) because that exception has already occurred (been "detected" in the SDM's
weird wording).  The VM-Exit is not happening *before* the #PF, it's happening
after the #PF is "detected", while the #PF is being vectored.

>From a virtualization perspective, any other implementation is basically unworkable,
as it would require the host to gain control after an exception is successfully
vectored.  I.e. the absense of any mechanisms to support that effectively confirms
that the CPU writes CR2 before attempting to deliver the exception to software.

> So what I was referring to is, if we write CR2 for a pending exception
> to L2, and then exit to L1, L1 would perceive a chance in CR2 without an
> ongoing #PF in exit_int_info. I believe the equivalent VMX function is
> vmcs12_save_pending_event().

Also poorly named :-/

> All that to say, we should not deliver the payload of an exception to L2
> before it's actually injected.

As above, those helpers deal with exceptions that have already been injected by
KVM.

> > > It would actually be great to drop the is_guest_mode() check here but
> > > leave the call, because the ordering problem between KVM_VCPU_SET_EVENTS
> > > and KVM_SET_SREGS goes away, and I *think* we can drop the
> > > kvm_deliver_exception_payload() call in
> > > kvm_vcpu_ioctl_x86_get_vcpu_events().
> > >
> > > The only problem would be CR2 getting updated without a fault being
> > > reflected in the vmcb12's exit_int_info AFAICT.
> > 
> > No, that particular case is a non-issue, because the code immediately above has
> > already verified that KVM will *not* morph the #PF to a nested VM-Exit.  Note,
> > the "queue:" path is just for non-contributory exceptions and doesn't change the
> > VM-Exit change anyways.
> 
> What I meant was not stuffing the #PF into the VMCB/VMCS because it's
> intercepted, but the #PF being stuffed into exit_int_info or
> idt_vectoring_info.
> 
> If we drop the guest mode check here, we could end up with CR2 updated
> and a #PF not reflected in exit_int_info/idt_vectoring_info (assuming
> #PF is not intercepted).

No, because once {svm,vmx}_inject_exception() have been reach, KVM has fully
committed to delivering the exception to the guest.  If KVM cancels KVM_RUN, e.g.
because of a pending signal from userspace to initiate save/restore, KVM calls
kvm_x86_ops.cancel_injection() so that vendor code can move the to-be-injected
exception from the VMCS/VMCB back to vcpu->arch.exception.  Note that
kvm_requeue_exception() (a) sets injected=true and (b) deliberately doesn't
track any payload, because the payload has already been delivered.

If VM-Enter is executed and a non-nested VM-Exit occurs, then hardware saves the
in-progress exception in VMCB.exit_int_info/VMCS.idt_vectoring_info, and KVM
moves the exception back to vcpu->arch.exception via vmx_complete_interrupts()
and svm_complete_interrupts() (which are also used for cancelling injection,
because the logic is identical, only the VMCS/VMCB source differs).

For nested VM-Exit, KVM needs to emulate that behavior.  The exception has already
been "detected" by KVM, and the payload has already been delivered, but a VM-Exit
was encountered while vectoring the exception to software.

E.g. if a guest #PF occurs while the guest stack is at the bottom of a page, such
that the first N pushes will hit page X, and the last M pushes will hit page X-1,
and the write to page X-1 hits a #NPF / EPT Violation, then L1 will (and should!)
see an updated CR2, with the first N pushes to vector the exception resident in
page X.

> > So, with all of that in mind, I believe the best we can do is fully defer delivery
> > of the exception until it's actually injected, and then apply the quirk to the
> > relevant GET APIs.
> 
> I think this should work. I can test it for the nested case, the way I
> could reproduce the problem (with a VMM that does KVM_GET_SREGS before
> KVM_GET_VCPU_EVENTS, but does not use KVM_CAP_EXCEPTION_PAYLOAD) is by
> intercepting and re-injecting all #PFs from L2, and then repeatedly
> doing save+restore while L2 is doing some heavy lifting (building GCC).
> This generates a lot of #PF exceptions to be saved+restored, and we
> eventually get a segfault because of corrupted CR2 in L2.
> 
> Removing the is_guest_mode() check in kvm_multiple_exception() fixes it
> by prematurely delivering the payload when it's queued. I think your fix
> will also work by prematurely delivering the payload at save time. This
> is actually more corect because at restore time the exception will
> become injected and treated as such (e.g. shows up in exit_int_info).
> 
> Do you intend to send a patch? Or should I send it out (separate from
> the current one) with you as the author?

I'll send a patch for this, there's a lot of historical information I want to
capture.

Can you send a v2 for _this_ patch, without the comment change?