lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aThIQzni6fC1qdgj@google.com>
Date: Tue, 9 Dec 2025 08:03:15 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Jim Mattson <jmattson@...gle.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested VMRUN

On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> All accesses to the VMCB12 in the guest memory are limited to
> nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> the function execution. Unmapping right after the consistency checks is
> possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> used after being unmapped.
> 
> Move all accesses to the VMCB12 into a new helper,
> nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> caches the needed fields, performs consistency checks, and unmaps it.
> This limits the scope of the VMCB12 mapping appropriately. It also
> slightly simplifies the cleanup path of nested_svm_vmrun().
> 
> nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> fail, maintaining the current behavior of skipping the instructions and
> unmapping the VMCB12 (although in the opposite order).
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> ---
>  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
>  1 file changed, 34 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ddcd545ec1c3c..a48668c36a191 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	return 0;
>  }
>  
> +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)

"read_vmcb12"() sounds like a generic helper to read a specific field.  And if
the name is more specific, then I think we can drop the "vmrun" scoping.  To
aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
about nested_svm_copy_vmcb12_to_cache()?

> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	struct kvm_host_map map;
> +	struct vmcb *vmcb12;
> +	int ret;
> +
> +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> +	if (ret)
> +		return ret;
> +
> +	vmcb12 = map.hva;
> +
> +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> +
> +	if (!nested_vmcb_check_save(vcpu) ||
> +	    !nested_vmcb_check_controls(vcpu)) {
> +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> +		vmcb12->control.exit_code_hi = 0;
> +		vmcb12->control.exit_info_1  = 0;
> +		vmcb12->control.exit_info_2  = 0;
> +		ret = -1;

I don't love shoving the consistency checks in here.  I get why you did it, but
it's very surprising to see (and/or easy to miss) these consistency checks.  The
caller also ends up quite wonky:

	if (ret == -EINVAL) {
		kvm_inject_gp(vcpu, 0);
		return 1;
	} else if (ret) {
		return kvm_skip_emulated_instruction(vcpu);
	}

	ret = kvm_skip_emulated_instruction(vcpu);

Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
weird way.

Ugh, and there's also this nastiness in nested_vmcb_check_controls():

	 * Make sure we did not enter guest mode yet, in which case
	 * kvm_read_cr0() could return L2's CR0.
	 */
	WARN_ON_ONCE(is_guest_mode(vcpu));
	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));

nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
They just make it harder to see what KVM is checking in the "normal" flow.

Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
doing "early" consistency checks in nested_svm_vmrun().

  1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
     _requires_ GIF=0 before VMRUN:

        It is assumed that VMM software cleared GIF some time before executing
        the VMRUN instruction, to ensure an atomic state switch.

     And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
     incorrectly leave GIF=1.


  2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
     because the checks aren't performed by enter_svm_guest_mode().  I don't see
     anything that would prevent vmcb12 from being modified by the guest bewteen
     SMI and RSM.

Moving the consistency checks into enter_svm_guest_mode() would solve all three
(four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
kvm_vcpu_map_readonly().

Compile tested only, but I think we can end up with delta like so:

---
 arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
 1 file changed, 20 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7c86987fdaca..8a0df6c535b5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
 	return true;
 }
 
-static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
-					 struct vmcb_ctrl_area_cached *control,
-					 unsigned long l1_cr0)
+static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
+				       struct vmcb_ctrl_area_cached *control,
+				       unsigned long l1_cr0)
 {
 	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
 		return false;
@@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
 }
 
 /* Common checks that apply to both L1 and L2 state.  */
-static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
-				     struct vmcb_save_area_cached *save)
+static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
+				   struct vmcb_save_area_cached *save)
 {
 	if (CC(!(save->efer & EFER_SVME)))
 		return false;
@@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
 	return true;
 }
 
-static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb_save_area_cached *save = &svm->nested.save;
-
-	return __nested_vmcb_check_save(vcpu, save);
-}
-
-static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
-
-	/*
-	 * Make sure we did not enter guest mode yet, in which case
-	 * kvm_read_cr0() could return L2's CR0.
-	 */
-	WARN_ON_ONCE(is_guest_mode(vcpu));
-	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
-}
-
 static
 void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 					 struct vmcb_ctrl_area_cached *to,
@@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
 	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
+
+	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
+	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
+					svm->vmcb01.ptr->save.cr0))
+		return -EINVAL;
+
 	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
 	nested_vmcb02_prepare_save(svm);
 
@@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
 	return 0;
 }
 
-static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
+static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_host_map map;
 	struct vmcb *vmcb12;
-	int ret;
+	int r;
 
-	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
-	if (ret)
-		return ret;
+	r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
+	if (r)
+		return r;
 
 	vmcb12 = map.hva;
 
 	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
 	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
 
-	if (!nested_vmcb_check_save(vcpu) ||
-	    !nested_vmcb_check_controls(vcpu)) {
-		vmcb12->control.exit_code    = SVM_EXIT_ERR;
-		vmcb12->control.exit_code_hi = -1u;
-		vmcb12->control.exit_info_1  = 0;
-		vmcb12->control.exit_info_2  = 0;
-		ret = -1;
-	}
-
 	kvm_vcpu_unmap(vcpu, &map);
-	return ret;
+	return 0;
 }
 
 int nested_svm_vmrun(struct kvm_vcpu *vcpu)
@@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 		return -EINVAL;
 
 	vmcb12_gpa = svm->vmcb->save.rax;
-	ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
-	if (ret == -EINVAL) {
+	if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
 		kvm_inject_gp(vcpu, 0);
 		return 1;
-	} else if (ret) {
-		return kvm_skip_emulated_instruction(vcpu);
 	}
 
 	ret = kvm_skip_emulated_instruction(vcpu);
@@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	ret = -EINVAL;
 	__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
 	/* 'save' contains L1 state saved from before VMRUN */
-	if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
+	if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
 		goto out_free;
 
 	/*
@@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	if (!(save->cr0 & X86_CR0_PG) ||
 	    !(save->cr0 & X86_CR0_PE) ||
 	    (save->rflags & X86_EFLAGS_VM) ||
-	    !__nested_vmcb_check_save(vcpu, &save_cached))
+	    !nested_vmcb_check_save(vcpu, &save_cached))
 		goto out_free;
 
 

base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
--


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ