linux-kernel - Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <618614fa6c62a232d95da55546137251e1847f48.camel@intel.com>
Date: Mon, 18 Mar 2024 23:46:11 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>
CC: "Zhang, Tina" <tina.zhang@...el.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Yuan, Hang" <hang.yuan@...el.com>, "Huang, Kai"
	<kai.huang@...el.com>, "Chen, Bo2" <chen.bo@...el.com>, "sagis@...gle.com"
	<sagis@...gle.com>, "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"Aktas, Erdem" <erdemaktas@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "sean.j.christopherson@...el.com"
	<sean.j.christopherson@...el.com>
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for
 unsupported cases

On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@...el.com wrote:
> From: Sean Christopherson <sean.j.christopherson@...el.com>
> 
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make
> sense
> for private memory.  Also currently, page migration isn't supported
> for TDX
> yet. (TDX architecturally supports page migration. it's KVM and
> kernel
> implementation issue.)
> 
> Regarding memory type change (mtrr virtualization and lapic page
> mapping
> change), pages are zapped by kvm_zap_gfn_range().  On the next KVM
> page
> fault, the SPTE entry with a new memory type for the page is
> populated.
> Regarding page migration, pages are zapped by the mmu notifier. On
> the next
> KVM page fault, the new migrated page is populated.  Don't zap
> private
> pages on unmapping for those two cases.

Is the migration case relevant to TDX?

> 
> When deleting/moving a KVM memory slot, zap private pages. Typically
> tearing down VM.  Don't invalidate private page tables. i.e. zap only
> leaf
> SPTEs for KVM mmu that has a shared bit mask. The existing
> kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-
> lock
> of mmu_lock so that other vcpu can operate on KVM mmu concurrently. 
> It
> marks the root page table invalid and zaps SPTEs of the root page
> tables. The TDX module doesn't allow to unlink a protected root page
> table
> from the hardware and then allocate a new one for it. i.e. replacing
> a
> protected root page table.  Instead, zap only leaf SPTEs for KVM mmu
> with a
> shared bit mask set.

I get the part about only zapping leafs and not the root and mid-level
PTEs. But why the MTRR, lapic page and migration part? Why should those
not be zapped? Why is migration a consideration when it is not
supported?

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 61
> ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h |  5 ++--
>  3 files changed, 92 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0d6d4506ec97..30c86e858ae4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm
> *kvm)
>          * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock
> and yield.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_invalidate_all_roots(kvm);
> +               kvm_tdp_mmu_invalidate_all_roots(kvm, true);
>  
>         /*
>          * Notify all vcpus to reload its shadow page table and flush
> TLB.
> @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
>         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
>         if (tdp_mmu_enabled)
> -               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush);
> +               /*
> +                * zap_private = false. Zap only shared pages.
> +                *
> +                * kvm_zap_gfn_range() is used when MTRR or PAT
> memory
> +                * type was changed.  Later on the next kvm page
> fault,
> +                * populate it with updated spte entry.
> +                * Because only WB is supported for private pages,
> don't
> +                * care of private pages.
> +                */
> +               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush, false);
>  
>         if (flush)
>                 kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
> gfn_start);
> @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm
> *kvm)
>         kvm_mmu_zap_all(kvm);
>  }
>  
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct
> kvm_memory_slot *slot)

What about kvm_mmu_zap_memslot_leafs() instead?

> +{
> +       bool flush = false;

It doesn't need to be initialized if it passes false directly into
kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to
understand.

> +
> +       write_lock(&kvm->mmu_lock);
> +
> +       /*
> +        * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't
> required, worst
> +        * case scenario we'll have unused shadow pages lying around
> until they
> +        * are recycled due to age or when the VM is destroyed.
> +        */
> +       if (tdp_mmu_enabled) {
> +               struct kvm_gfn_range range = {
> +                     .slot = slot,
> +                     .start = slot->base_gfn,
> +                     .end = slot->base_gfn + slot->npages,
> +                     .may_block = true,
> +
> +                     /*
> +                      * This handles both private gfn and shared
> gfn.
> +                      * All private page should be zapped on memslot
> deletion.
> +                      */
> +                     .only_private = true,
> +                     .only_shared = true,

only_private and only_shared are both true? Shouldn't they both be
false? (or just unset)

> +               };
> +
> +               flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range,
> flush);
> +       } else {
> +               /* TDX supports only TDP-MMU case. */
> +               WARN_ON_ONCE(1);

How about a KVM_BUG_ON() instead? If somehow this is reached, we don't
want the caller thinking the pages are zapped, then enter the guest
with pages mapped that have gone elsewhere.

> +               flush = true;

Why flush?

> +       }
> +       if (flush)
> +               kvm_flush_remote_tlbs(kvm);
> +
> +       write_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>                                    struct kvm_memory_slot *slot)
>  {
> -       kvm_mmu_zap_all_fast(kvm);
> +       if (kvm_gfn_shared_mask(kvm))

There seems to be an attempt to abstract away the existence of Secure-
EPT in mmu.c, that is not fully successful. In this case the code
checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
in a way specific needed by S-EPT. It ends up being a little confusing
because the actual check is about whether there is a shared bit. It
only works because only S-EPT is the only thing that has a
kvm_gfn_shared_mask().

Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
but is more honest about what we are getting up to here. I'm not sure
though, what do you think?

> +               /*
> +                * Secure-EPT requires to release PTs from the leaf. 
> The
> +                * optimization to zap root PT first with child PT
> doesn't
> +                * work.
> +                */
> +               kvm_mmu_zap_memslot(kvm, slot);
> +       else
> +               kvm_mmu_zap_all_fast(kvm);
>  }
>  
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d47f0daf1b03..e7514a807134 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * for zapping and thus puts the TDP MMU's reference to each
> root, i.e.
>          * ultimately frees all roots.
>          */
> -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> +       kvm_tdp_mmu_invalidate_all_roots(kvm, false);
>         kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct
> kvm_mmu_page *sp)
>   * operation can cause a soft lockup.
>   */
>  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page
> *root,
> -                             gfn_t start, gfn_t end, bool can_yield,
> bool flush)
> +                             gfn_t start, gfn_t end, bool can_yield,
> bool flush,
> +                             bool zap_private)
>  {
>         struct tdp_iter iter;
>  
> @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm,
> struct kvm_mmu_page *root,
>  
>         lockdep_assert_held_write(&kvm->mmu_lock);
>  
> +       WARN_ON_ONCE(zap_private && !is_private_sp(root));

All the callers have zap_private as zap_private && is_private_sp(root).
What badness is it trying to uncover?

> +       if (!zap_private && is_private_sp(root))
> +               return false;
> +