linux-kernel - Re: [PATCH v3 6/6] KVM: VMX: Make CR0.WP a guest owned bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf70127f-5e15-3f5e-08fc-884877b2421a@grsecurity.net>
Date:   Mon, 20 Mar 2023 22:31:14 +0100
From:   Mathias Krause <minipli@...ecurity.net>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH v3 6/6] KVM: VMX: Make CR0.WP a guest owned bit

On 15.03.23 23:30, Sean Christopherson wrote:
> On Wed, Feb 01, 2023, Mathias Krause wrote:
>> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
>> level W^X will suffer from the implied VMEXITs.
>>
>> For a direct MMU role there is no need to intercept a guest change of
>> CR0.WP, so simply make it a guest owned bit if we can do so.
>>
>> This implies that a read of a guest's CR0.WP bit might need a VMREAD.
>> However, the only potentially affected user seems to be kvm_init_mmu()
>> which is a heavy operation to begin with. But also most callers already
>> cache the full value of CR0 anyway, so no additional VMREAD is needed.
>> The only exception is nested_vmx_load_cr3().
>>
>> Add a new module parameter 'lazycr0' to allow users to revert back to
>> the old behaviour by loading kvm-intel.ko with 'lazycr0=0'.
>>
>> This change is VMX-specific, as SVM has no such fine grained control
>> register intercept control.
>>
>> Suggested-by: Sean Christopherson <seanjc@...gle.com>
>> Signed-off-by: Mathias Krause <minipli@...ecurity.net>
>> ---
>>
>> Initially I wanted to implement the scheme Sean sketched[1]: having a
>> threshold where we would switch from eager to lazy CR0.WP tracking after
>> toggling the bit often enough, make the bit guest owned afterwards and
>> VMREAD CR0 when needed. However, when starting to look for users that
>> would be affected, I only found kvm_init_mmu() (via kvm_init_mmu() ->
>> vcpu_to_role_regs() -> kvm_read_cr0_bits(KVM_MMU_CR0_ROLE_BITS)). It has
>> only these three interesting callers:
>> 1/ kvm_mmu_reset_context(), which isn't all that interesting, as it's a
>>    heavy weight operation anyway and many of the control flows leading
>>    to it already cache the value of CR0, so no additional VMREAD is
>>    needed,
>> 2/ nested_vmx_load_cr3() and
>> 3/ kvm_post_set_cr0(), only when CR0.WP was toggled and the MMU is in
>>    direct mode (optimization introduced by patch 3).
>>
>> The last case's most interesting caller is likely kvm_set_cr0(), which
>> already caches the written CR0 value, thereby vanishes the need for
>> another VMREAD in vcpu_to_role_regs().
>>
>> That's why I went with the much simpler approach and always allow CR0.WP
>> to be guest owned if EPT is enabled as well.
> 
> Nice!
> 
>> There's nothing we can do for SVM, though :/
> 
> :/ indeed
> 
>> [1] https://lore.kernel.org/kvm/Y8cTMnyBzNdO5dY3@google.com/
>> ---
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d3b49e0b6c32..1969360d2744 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -91,6 +91,9 @@ module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);
>>  bool __read_mostly enable_ept = 1;
>>  module_param_named(ept, enable_ept, bool, S_IRUGO);
>>  
>> +bool __read_mostly enable_lazy_cr0 = 1;
>> +module_param_named(lazycr0, enable_lazy_cr0, bool, S_IRUGO);
> 
> Unless someone crawls out of the woodworks to object, let's omit the module param
> and make this unconditional.  We typically add module params for behavior where
> there are legitimate downsides even if KVM is bug free, or for features that are
> dependent on hardware.  E.g. testing shadow paging without a knob to disable EPT
> would require acces to really ancient CPUs.
> 
> The one exception that comes to mind is force_flush_and_sync_on_reuse, but TLB
> bugs tend to be subtle and hard to hit, whereas if we break something with CR0.WP
> emulation, the breakage should be immediate and obvious.
> 
>>  bool __read_mostly enable_unrestricted_guest = 1;
>>  module_param_named(unrestricted_guest,
>>  			enable_unrestricted_guest, bool, S_IRUGO);
>> @@ -4765,7 +4768,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>>  	/* 22.2.1, 20.8.1 */
>>  	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>>  
>> -	vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
>> +	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
>>  	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>>  
>>  	set_cr4_guest_host_mask(vmx);
>> @@ -8370,6 +8373,10 @@ static __init int hardware_setup(void)
>>  		return -EOPNOTSUPP;
>>  	}
>>  
>> +	/* Need EPT for lazy CR0.WP synchronization. */
>> +	if (!enable_ept)
>> +		enable_lazy_cr0 = 0;
> 
> Heh, just realized that this code would be broken if nested TDP wasn't exempt
> from including CR0.WP in the MMU role.  Better to be lucky than good :-)

=:)

> 
> And similar to similar to kvm_post_set_cr0(), the CR0.PG=0 case _could_ let
> CR0.WP be guest-owned, but I don't think that's worth doing as it introduces a
> subtle dependency on CR0 being up-to-date (or passed in).

And it has no real use case, IMHO. Aside from the academic exercise, why
would any sane operating system^W^W"system software" that runs non-paged
touch CR0.WP at all?

> 
> So this?
> 
> ---
>  arch/x86/kvm/kvm_cache_regs.h |  2 +-
>  arch/x86/kvm/vmx/nested.c     |  4 ++--
>  arch/x86/kvm/vmx/vmx.c        |  2 +-
>  arch/x86/kvm/vmx/vmx.h        | 18 ++++++++++++++++++
>  4 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 4c91f626c058..e50d353b5c1c 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -4,7 +4,7 @@
>  
>  #include <linux/kvm_host.h>
>  
> -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
> +#define KVM_POSSIBLE_CR0_GUEST_BITS	(X86_CR0_TS | X86_CR0_WP)
>  #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
>  	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
>  	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 7c4f5ca405c7..a0c92a2b3f65 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4478,7 +4478,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>  	 * CR0_GUEST_HOST_MASK is already set in the original vmcs01
>  	 * (KVM doesn't change it);
>  	 */
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmx_set_cr0(vcpu, vmcs12->host_cr0);
>  
>  	/* Same as above - no reason to call set_cr4_guest_host_mask().  */
> @@ -4629,7 +4629,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  	 */
>  	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
>  
> -	vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
>  
>  	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index da65d90984ae..136adccffc4b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4773,7 +4773,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	/* 22.2.1, 20.8.1 */
>  	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>  
> -	vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
>  	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>  
>  	set_cr4_guest_host_mask(vmx);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 2acdc54bc34b..423e9d3c9c40 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
>  				(1 << VCPU_EXREG_EXIT_INFO_1) | \
>  				(1 << VCPU_EXREG_EXIT_INFO_2))
>  
> +static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
> +{
> +	unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +
> +	/*
> +	 * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
> +	 * in order to construct shadow PTEs with the correct protections.
> +	 * Note!  CR0.WP technically can be passed through to the guest if
> +	 * paging is disabled, but checking CR0.PG would generate a cyclical
> +	 * dependency of sorts due to forcing the caller to ensure CR0 holds
> +	 * the correct value prior to determining which CR0 bits can be owned
> +	 * by L1.  Keep it simple and limit the optimization to EPT.
> +	 */
> +	if (!enable_ept)
> +		bits &= ~X86_CR0_WP;
> +	return bits;
> +}
> +
>  static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
>  {
>  	return container_of(kvm, struct kvm_vmx, kvm);
> 
> base-commit: 0b39948a802b5e76d65989b47ae36fe0dfbc10ad

LGTM!