lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Znkuh/+oeDeIY68f@yzhao56-desk.sh.intel.com>
Date: Mon, 24 Jun 2024 16:29:59 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "Gao, Chao" <chao.gao@...el.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Huang, Kai" <kai.huang@...el.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Aktas, Erdem"
	<erdemaktas@...gle.com>, "dmatlack@...gle.com" <dmatlack@...gle.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [PATCH v3 17/17] KVM: x86/tdp_mmu: Take root types for
 kvm_tdp_mmu_invalidate_all_roots()

On Sat, Jun 22, 2024 at 03:08:22AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > >          * for zapping and thus puts the TDP MMU's reference to each root,
> > > i.e.
> > >          * ultimately frees all roots.
> > >          */
> > > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> > all roots (mirror + direct) are invalidated here.
> 
> Right.
> > 
> > >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> > mmu_lock held for read, which should trigger KVM_BUG_ON() in
> > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> > mirror roots".
> > 
> > But up to now, the KVM_BUG_ON() is not triggered because
> > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> > below
> > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> > 
> > 
> > kvm_mmu_notifier_release
> >   kvm_flush_shadow_all
> >     kvm_arch_flush_shadow_all
> >       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> >       kvm_mmu_zap_all  ==>hold mmu_lock for write
> >         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> > 
> > kvm_destroy_vm
> >   kvm_arch_destroy_vm
> >     kvm_mmu_uninit_vm
> >       kvm_mmu_uninit_tdp_mmu
> >         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
> >         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> > for read
> > 
> > 
> > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> > notifier, why does it zap mirrored tdp when all other callbacks are with
> > KVM_FILTER_SHARED?
> > 
> > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> > held for write?
> 
> Sigh, thanks for flagging this. I agree it seems weird to free private memory
> from an MMU notifier callback. I also found this old thread where Sean NAKed the
> current approach (free hkid in mmu release):
> https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t
> 
> One challenge is that flush_shadow_all_private() needs to be done before
> kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
> is too late. Perhaps this is why it was shoved into mmu notifier release (which
> happens long before as you noted). Isaku, do you recall any other reasons?
> 
> But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
> could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
> the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
> the bottom of this mail.
It looks good to me.

> 
> But most of what is being discussed is in future patches where it starts to get
> into the TDX module interaction. So I wonder if we should drop this patch 17
> from "part 1" and include it with the next series so it can all be considered
> together.
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 2adf36b74910..3927731aa947 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP_OPTIONAL(vm_enable_cap)
>  KVM_X86_OP(vm_init)
> -KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
>  KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a72e5873808..8b2b79b39d0f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
>         unsigned int vm_size;
>         int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
>         int (*vm_init)(struct kvm *kvm);
> -       void (*flush_shadow_all_private)(struct kvm *kvm);
>         void (*vm_destroy)(struct kvm *kvm);
>         void (*vm_free)(struct kvm *kvm);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..4deeeac14324 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>          * lead to use-after-free.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -       /*
> -        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> -        * tearing down private page tables, TDX requires some TD resources to
> -        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> -        * kvm_x86_flush_shadow_all_private() for this.
> -        */
> -       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>         kvm_mmu_zap_all(kvm);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 68dfcdb46ab7..9e8b012aa8cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * ultimately frees all roots.
>          */
>         kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> -       kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
>         WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>          * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
nit: update the comment of kvm_tdp_mmu_zap_all() and explain why it's
KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.

>                 tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
>   * zap" completes.
>   */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
>  {
>         struct kvm_mmu_page *root;
>  
> -       read_lock(&kvm->mmu_lock);
> +       if (shared)
> +               read_lock(&kvm->mmu_lock);
> +       else
> +               write_lock(&kvm->mmu_lock);
>  
>         for_each_tdp_mmu_root_yield_safe(kvm, root) {
>                 if (!root->tdp_mmu_scheduled_root_to_zap)
> @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                  * that may be zapped, as such entries are associated with the
>                  * ASID on both VMX and SVM.
>                  */
> -               tdp_mmu_zap_root(kvm, root, true);
> +               tdp_mmu_zap_root(kvm, root, shared);
>  
>                 /*
>                  * The referenced needs to be put *after* zapping the root, as
> @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                 kvm_tdp_mmu_put_root(kvm, root);
>         }
>  
> -       read_unlock(&kvm->mmu_lock);
> +       if (shared)
> +               read_unlock(&kvm->mmu_lock);
> +       else
> +               write_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 56741d31048a..7927fa4a96e0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
> *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>                                   enum kvm_tdp_mmu_root_types root_types);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b6828e35eb17..3f9bfcd3e152 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
>         return vmx_vm_init(kvm);
>  }
>  
> -static void vt_flush_shadow_all_private(struct kvm *kvm)
> -{
> -       if (is_td(kvm))
> -               tdx_mmu_release_hkid(kvm);
> -}
> -
>  static void vt_vm_destroy(struct kvm *kvm)
>  {
> -       if (is_td(kvm))
> +       if (is_td(kvm)) {
> +               tdx_mmu_release_hkid(kvm);
>                 return;
> +       }
>  
>         vmx_vm_destroy(kvm);
>  }
> @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .vm_size = sizeof(struct kvm_vmx),
>         .vm_enable_cap = vt_vm_enable_cap,
>         .vm_init = vt_vm_init,
> -       .flush_shadow_all_private = vt_flush_shadow_all_private,
>         .vm_destroy = vt_vm_destroy,
>         .vm_free = vt_vm_free,
>  
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ