lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <nbkpibgkill4hyuphsju7id5v73lufmas5sammpj6umvhzd25t@y6dkgguq2cuy>
Date: Tue, 9 Dec 2025 18:24:46 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Jim Mattson <jmattson@...gle.com>, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 10/13] KVM: nSVM: Restrict mapping VMCB12 on nested
 VMRUN

On Tue, Dec 09, 2025 at 08:03:15AM -0800, Sean Christopherson wrote:
> On Mon, Nov 10, 2025, Yosry Ahmed wrote:
> > All accesses to the VMCB12 in the guest memory are limited to
> > nested_svm_vmrun(). However, the VMCB12 remains mapped until the end of
> > the function execution. Unmapping right after the consistency checks is
> > possible, but it becomes easy-ish to introduce bugs where 'vmcb12' is
> > used after being unmapped.
> > 
> > Move all accesses to the VMCB12 into a new helper,
> > nested_svm_vmrun_read_vmcb12(),  that maps the VMCB12,
> > caches the needed fields, performs consistency checks, and unmaps it.
> > This limits the scope of the VMCB12 mapping appropriately. It also
> > slightly simplifies the cleanup path of nested_svm_vmrun().
> > 
> > nested_svm_vmrun_read_vmcb12() returns -1 if the consistency checks
> > fail, maintaining the current behavior of skipping the instructions and
> > unmapping the VMCB12 (although in the opposite order).
> > 
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > ---
> >  arch/x86/kvm/svm/nested.c | 59 ++++++++++++++++++++++-----------------
> >  1 file changed, 34 insertions(+), 25 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ddcd545ec1c3c..a48668c36a191 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1023,12 +1023,39 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
> >  	return 0;
> >  }
> >  
> > +static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> 
> "read_vmcb12"() sounds like a generic helper to read a specific field.  And if
> the name is more specific, then I think we can drop the "vmrun" scoping.  To
> aligh with similar functions in VMX and __nested_copy_vmcb_save_to_cache(), how
> about nested_svm_copy_vmcb12_to_cache()?

nested_svm_copy_vmcb12_to_cache() sounds good.

> 
> > +{
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +	struct kvm_host_map map;
> > +	struct vmcb *vmcb12;
> > +	int ret;
> > +
> > +	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vmcb12 = map.hva;
> > +
> > +	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
> > +	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
> > +
> > +	if (!nested_vmcb_check_save(vcpu) ||
> > +	    !nested_vmcb_check_controls(vcpu)) {
> > +		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> > +		vmcb12->control.exit_code_hi = 0;
> > +		vmcb12->control.exit_info_1  = 0;
> > +		vmcb12->control.exit_info_2  = 0;
> > +		ret = -1;
> 
> I don't love shoving the consistency checks in here.  I get why you did it, but
> it's very surprising to see (and/or easy to miss) these consistency checks.  The
> caller also ends up quite wonky:
> 
> 	if (ret == -EINVAL) {
> 		kvm_inject_gp(vcpu, 0);
> 		return 1;
> 	} else if (ret) {
> 		return kvm_skip_emulated_instruction(vcpu);
> 	}
> 
> 	ret = kvm_skip_emulated_instruction(vcpu);
> 
> Ha!  And it's buggy.  __kvm_vcpu_map() can return -EFAULT if creating a host
> mapping fails.  Eww, and blindly using '-1' as the "failed a consistency check"
> is equally cross, as it relies on kvm_vcpu_map() not returning -EPERM in a very
> weird way.

I was trying to maintain the pre-existing behavior as much as possible,
and I think the existing code will handle -EFAULT from kvm_vcpu_map() in
the same way (skip the instruction and return).

I guess I shouldn't have assumed maintaining the existing behavior is
the right thing to do.

It's honestly really hard to detangle the return values of different KVM
functions and what they mean. "return 1" here is not very meaningful,
and the return code from kvm_skip_emulated_instruction() is not
documented, so I don't really know what we're supposed to return here in
what cases. The error code are usually not interpreted until a few
layers higher up the callstack.

I agree that returning -1 is not great, but in this case the caller (and
the existing code) only cared about differentiating -EINVAL from others,
and I found other KVM functions returning -1 so I thought I shouldn't
overthink the return value. But yeah, you're right, no more -1's :)

Hence, I preferred to leave things as-is as much as possible.

> 
> Ugh, and there's also this nastiness in nested_vmcb_check_controls():
> 
> 	 * Make sure we did not enter guest mode yet, in which case
> 	 * kvm_read_cr0() could return L2's CR0.
> 	 */
> 	WARN_ON_ONCE(is_guest_mode(vcpu));
> 	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> 
> nested_vmcb_check_save() and nested_vmcb_check_controls() really shouldn't exist.
> They just make it harder to see what KVM is checking in the "normal" flow.
> 
> Aha!  And I'm fairly certain there are at least two pre-existing bugs due to KVM
> doing "early" consistency checks in nested_svm_vmrun().
> 
>   1. KVM doesn't clear GIF on the early #VMEXIT.  In classic APM fashion, nothing
>      _requires_ GIF=0 before VMRUN:
> 
>         It is assumed that VMM software cleared GIF some time before executing
>         the VMRUN instruction, to ensure an atomic state switch.
> 
>      And so VMRUN with GIF=1 that hits an "early" consistency check #VMEXIT would
>      incorrectly leave GIF=1.
> 
> 
>   2. svm_leave_smm() is missing consistency checks on the newly loaded guest state,
>      because the checks aren't performed by enter_svm_guest_mode().  I don't see
>      anything that would prevent vmcb12 from being modified by the guest bewteen
>      SMI and RSM.
> 
> Moving the consistency checks into enter_svm_guest_mode() would solve all three
> (four?) problems.  And as a bonus, nested_svm_copy_vmcb12_to_cache() can use
> kvm_vcpu_map_readonly().

Anyway, I will move the consistency checks as you suggested, I agree
that is much better. Thanks!

> 
> Compile tested only, but I think we can end up with delta like so:
> 
> ---
>  arch/x86/kvm/svm/nested.c | 67 ++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 7c86987fdaca..8a0df6c535b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -372,9 +372,9 @@ static bool nested_svm_check_event_inj(struct kvm_vcpu *vcpu, u32 event_inj)
>  	return true;
>  }
>  
> -static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> -					 struct vmcb_ctrl_area_cached *control,
> -					 unsigned long l1_cr0)
> +static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
> +				       struct vmcb_ctrl_area_cached *control,
> +				       unsigned long l1_cr0)
>  {
>  	if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
>  		return false;
> @@ -408,8 +408,8 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
>  }
>  
>  /* Common checks that apply to both L1 and L2 state.  */
> -static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> -				     struct vmcb_save_area_cached *save)
> +static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu,
> +				   struct vmcb_save_area_cached *save)
>  {
>  	if (CC(!(save->efer & EFER_SVME)))
>  		return false;
> @@ -448,27 +448,6 @@ static bool __nested_vmcb_check_save(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> -static bool nested_vmcb_check_save(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_save_area_cached *save = &svm->nested.save;
> -
> -	return __nested_vmcb_check_save(vcpu, save);
> -}
> -
> -static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
> -
> -	/*
> -	 * Make sure we did not enter guest mode yet, in which case
> -	 * kvm_read_cr0() could return L2's CR0.
> -	 */
> -	WARN_ON_ONCE(is_guest_mode(vcpu));
> -	return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
> -}
> -
>  static
>  void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
>  					 struct vmcb_ctrl_area_cached *to,
> @@ -1004,6 +983,12 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	nested_svm_copy_common_state(svm->vmcb01.ptr, svm->nested.vmcb02.ptr);
>  
>  	svm_switch_vmcb(svm, &svm->nested.vmcb02);
> +
> +	if (!nested_vmcb_check_save(vcpu, &svm->nested.save) ||
> +	    !nested_vmcb_check_controls(vcpu, &svm->nested.ctl,
> +					svm->vmcb01.ptr->save.cr0))
> +		return -EINVAL;
> +
>  	nested_vmcb02_prepare_control(svm, save->rip, save->cs.base);
>  	nested_vmcb02_prepare_save(svm);
>  
> @@ -1025,33 +1010,24 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa, bool from_vmrun)
>  	return 0;
>  }
>  
> -static int nested_svm_vmrun_read_vmcb12(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
> +static int nested_svm_copy_vmcb12_to_cache(struct kvm_vcpu *vcpu, u64 vmcb12_gpa)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  	struct kvm_host_map map;
>  	struct vmcb *vmcb12;
> -	int ret;
> +	int r;
>  
> -	ret = kvm_vcpu_map(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> -	if (ret)
> -		return ret;
> +	r = kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcb12_gpa), &map);
> +	if (r)
> +		return r;
>  
>  	vmcb12 = map.hva;
>  
>  	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
>  	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
>  
> -	if (!nested_vmcb_check_save(vcpu) ||
> -	    !nested_vmcb_check_controls(vcpu)) {
> -		vmcb12->control.exit_code    = SVM_EXIT_ERR;
> -		vmcb12->control.exit_code_hi = -1u;
> -		vmcb12->control.exit_info_1  = 0;
> -		vmcb12->control.exit_info_2  = 0;
> -		ret = -1;
> -	}
> -
>  	kvm_vcpu_unmap(vcpu, &map);
> -	return ret;
> +	return 0;
>  }
>  
>  int nested_svm_vmrun(struct kvm_vcpu *vcpu)
> @@ -1082,12 +1058,9 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
>  		return -EINVAL;
>  
>  	vmcb12_gpa = svm->vmcb->save.rax;
> -	ret = nested_svm_vmrun_read_vmcb12(vcpu, vmcb12_gpa);
> -	if (ret == -EINVAL) {
> +	if (nested_svm_copy_vmcb12_to_cache(vcpu, vmcb12_gpa)) {
>  		kvm_inject_gp(vcpu, 0);
>  		return 1;
> -	} else if (ret) {
> -		return kvm_skip_emulated_instruction(vcpu);
>  	}
>  
>  	ret = kvm_skip_emulated_instruction(vcpu);
> @@ -1919,7 +1892,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	ret = -EINVAL;
>  	__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
>  	/* 'save' contains L1 state saved from before VMRUN */
> -	if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
> +	if (!nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
>  		goto out_free;
>  
>  	/*
> @@ -1938,7 +1911,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>  	if (!(save->cr0 & X86_CR0_PG) ||
>  	    !(save->cr0 & X86_CR0_PE) ||
>  	    (save->rflags & X86_EFLAGS_VM) ||
> -	    !__nested_vmcb_check_save(vcpu, &save_cached))
> +	    !nested_vmcb_check_save(vcpu, &save_cached))
>  		goto out_free;
>  
>  
> 
> base-commit: 01597a665f5dcf8d7cfbedf36f4e6d46d045eb4f
> --
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ