linux-kernel - Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYIebtv3nNnsqUiZ@google.com>
Date: Tue, 3 Feb 2026 08:12:30 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on
 nested #VMEXIT

On Tue, Feb 03, 2026, Yosry Ahmed wrote:
> KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
> nested #VMEXIT. Use the value from vcpu->arch.cr2 instead.
> 
> The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
> of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
> only out-of-sync in two cases: after migration, and after a #PF is

Nit, instead of "migration", talk about state save/restore, and then list live
migration as an example.  Most of the time it's fairly obvious that "migration"
in KVM means "live migration", but not always.  E.g. task migration often comes
into play, as does page migration.

And more importantly, the above statement is wrong when state is saved/restored
for something other than migration.  E.g. if userspace is restoring a snapshot
for reasons other than migrating a VM.

> injected into L2.
> 
> After migration, the value of CR2 in vmcb02 is uninitialized (i.e.

save/restore

> zero),

This isn't guaranteed.  E.g. if state is restored into a vCPU that was already
running, vmcb02.save.cr2 could hold any number of things.

> as KVM_SET_SREGS restores CR2 value to vcpu->arch.cr2. Using
> vcpu->arch.cr2 to update vmcb12 is the right thing to do.
> 
> The #PF injection case is more nuanced. It occurs if KVM injects a #PF
> into L2, then exits to L1 before it actually runs L2. Although the APM
> is a bit unclear about when CR2 is written during a #PF, the SDM is more
> clear:
> 
> 	Processors update CR2 whenever a page fault is detected. If a
> 	second page fault occurs while an earlier page fault is being
> 	delivered, the faulting linear address of the second fault will
> 	overwrite the contents of CR2 (replacing the previous address).
> 	These updates to CR2 occur even if the page fault results in a
> 	double fault or occurs during the delivery of a double fault.
> 
> KVM injecting the exception surely counts as the #PF being "detected".

Heh, "detected" is definitely poor wording in the SDM.

> More importantly, when an exception is injected into L2 at the time of a
> synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
> such that an L1 hypervisor can re-inject the exception. If CR2 is not
> written at that point, the L1 hypervisor have no way of correctly
> re-injecting the #PF. Hence, using vcpu->arch.cr2 is also the right
> thing to write in vmcb12 in this case.
> 
> Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
> L2, only when it is injected. The distinction is important, because only
> injected exceptions are propagated to L1 through exit_int_info. It would
> be incorrect to update CR2 in vmcb12 for a pending #PF, as L1 would
> perceive an updated CR2 value with no #PF. Update the comment in
> kvm_deliver_exception_payload() to clarify this.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 2 +-
>  arch/x86/kvm/x86.c        | 7 +++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd5..9031746ce2db1 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1156,7 +1156,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>  	vmcb12->save.efer   = svm->vcpu.arch.efer;
>  	vmcb12->save.cr0    = kvm_read_cr0(vcpu);
>  	vmcb12->save.cr3    = kvm_read_cr3(vcpu);
> -	vmcb12->save.cr2    = vmcb02->save.cr2;
> +	vmcb12->save.cr2    = vcpu->arch.cr2;
>  	vmcb12->save.cr4    = svm->vcpu.arch.cr4;
>  	vmcb12->save.rflags = kvm_get_rflags(vcpu);
>  	vmcb12->save.rip    = kvm_rip_read(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index db3f393192d94..1015522d0fbd7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,6 +864,13 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
> +		/*
> +		 * Only injected exceptions are propagated to L1 in
> +		 * vmcb12/vmcs12 on nested #VMEXIT. Hence, do not deliver the

Nit, #VMEXIT is SVM specific terminology.  VM-Exit is more vendor agnostic.

> +		 * exception payload for L2 until the exception is injected.
> +		 * Otherwise, L1 would perceive the updated payload without a
> +		 * corresponding exception.

Huh.  I'm fairly certain this code is now at best unnecessary, and at worst
actively harmful.  Because the more architectural way to document this code is:

		/*
		 * If L2 is active, defer delivery of the payload until the
		 * exception is actually injected to avoid clobbering state if
		 * L1 wants to intercept the exception (the architectural state
		 * is NOT updated if the exeption is morphed to a VM-Exit).
		 */

But thanks to commit 7709aba8f716 ("KVM: x86: Morph pending exceptions to pending
VM-Exits at queue time"), KVM already *knows* the exception won't be morphed to a
VM-Exit.

Ugh, and I'm pretty sure I botched kvm_vcpu_ioctl_x86_get_vcpu_events() in that
commit.  Because invoking kvm_deliver_exception_payload() when the exception was
morphed to a VM-Exit is wrong.  Oh, wait, this is the !exception_payload_enabled
case.  So never mind, that's simply an unfixable bug, as the second comment alludes
to.

	/*
	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
	 * the only time there can be two queued exceptions is if there's a
	 * non-exiting _injected_ exception, and a pending exiting exception.
	 * In that case, ignore the VM-Exiting exception as it's an extension
	 * of the injected exception.
	 */
	if (vcpu->arch.exception_vmexit.pending &&
	    !vcpu->arch.exception.pending &&
	    !vcpu->arch.exception.injected)
		ex = &vcpu->arch.exception_vmexit;
	else
		ex = &vcpu->arch.exception;

	/*
	 * In guest mode, payload delivery should be deferred if the exception
	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
	 * propagate the payload and so it cannot be safely deferred.  Deliver
	 * the payload if the capability hasn't been requested.
	 */
	if (!vcpu->kvm->arch.exception_payload_enabled &&
	    ex->pending && ex->has_payload)
		kvm_deliver_exception_payload(vcpu, ex);

So yeah, I _think_ we could drop the is_guest_mode() check.  However, even better
would be to drop this call *entirely*, i.e.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0112c515584..00a39c95631d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
                vcpu->arch.exception.error_code = error_code;
                vcpu->arch.exception.has_payload = has_payload;
                vcpu->arch.exception.payload = payload;
-               if (!is_guest_mode(vcpu))
-                       kvm_deliver_exception_payload(vcpu,
-                                                     &vcpu->arch.exception);
                return;
        }
 
Because KVM really shouldn't update CR2 until the excpetion is actually injected
(or the state is at risk of being lost because exception_payload_enabled==false).
E.g. in the (extremely) unlikely (and probably impossible to configure reliably)
scenario that userspace deliberately drops a pending exception, arch state shouldn't
be updated.

> +		 */
>  		if (!is_guest_mode(vcpu))
>  			kvm_deliver_exception_payload(vcpu,
>  						      &vcpu->arch.exception);
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
>