linux-kernel - Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240319235654.GC1994522@ls.amr.corp.intel.com>
Date: Tue, 19 Mar 2024 16:56:54 -0700
From: Isaku Yamahata <isaku.yamahata@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"Zhang, Tina" <tina.zhang@...el.com>,
	"seanjc@...gle.com" <seanjc@...gle.com>,
	"Yuan, Hang" <hang.yuan@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Chen, Bo2" <chen.bo@...el.com>,
	"sagis@...gle.com" <sagis@...gle.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"Aktas, Erdem" <erdemaktas@...gle.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>,
	"sean.j.christopherson@...el.com" <sean.j.christopherson@...el.com>,
	isaku.yamahata@...ux.intel.com
Subject: Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages
 for unsupported cases

On Mon, Mar 18, 2024 at 11:46:11PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@...el.com wrote:
> > From: Sean Christopherson <sean.j.christopherson@...el.com>
> > 
> > TDX supports only write-back(WB) memory type for private memory
> > architecturally so that (virtualized) memory type change doesn't make
> > sense
> > for private memory.  Also currently, page migration isn't supported
> > for TDX
> > yet. (TDX architecturally supports page migration. it's KVM and
> > kernel
> > implementation issue.)
> > 
> > Regarding memory type change (mtrr virtualization and lapic page
> > mapping
> > change), pages are zapped by kvm_zap_gfn_range().  On the next KVM
> > page
> > fault, the SPTE entry with a new memory type for the page is
> > populated.
> > Regarding page migration, pages are zapped by the mmu notifier. On
> > the next
> > KVM page fault, the new migrated page is populated.  Don't zap
> > private
> > pages on unmapping for those two cases.
> 
> Is the migration case relevant to TDX?

We can forget about it because the page migration isn't supported yet.


> > When deleting/moving a KVM memory slot, zap private pages. Typically
> > tearing down VM.  Don't invalidate private page tables. i.e. zap only
> > leaf
> > SPTEs for KVM mmu that has a shared bit mask. The existing
> > kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-
> > lock
> > of mmu_lock so that other vcpu can operate on KVM mmu concurrently. 
> > It
> > marks the root page table invalid and zaps SPTEs of the root page
> > tables. The TDX module doesn't allow to unlink a protected root page
> > table
> > from the hardware and then allocate a new one for it. i.e. replacing
> > a
> > protected root page table.  Instead, zap only leaf SPTEs for KVM mmu
> > with a
> > shared bit mask set.
> 
> I get the part about only zapping leafs and not the root and mid-level
> PTEs. But why the MTRR, lapic page and migration part? Why should those
> not be zapped? Why is migration a consideration when it is not
> supported?


When we zap a page from the guest, and add it again on TDX even with the same
GPA, the page is zeroed.  We'd like to keep memory contents for those cases.

Ok, let me add those whys and drop migration part. Here is the updated one.

TDX supports only write-back(WB) memory type for private memory
architecturally so that (virtualized) memory type change doesn't make
sense for private memory.  When we remove the private page from the guest
and re-add it with the same GPA, the page is zeroed.

Regarding memory type change (mtrr virtualization and lapic page
mapping change), the current implementation zaps pages, and populate
the page with new memory type on the next KVM page fault.  It doesn't
work for TDX to have zeroed pages. Because TDX supports only WB, we
ignore the request for MTRR and lapic page change to not zap private
pages on unmapping for those two cases

TDX Secure-EPT requires removing the guest pages first and leaf
Secure-EPT pages in order. It doesn't allow zap a Secure-EPT entry
that has child pages.  It doesn't work with the current TDP MMU
zapping logic that zaps the root page table without touching child
pages.  Instead, zap only leaf SPTEs for KVM mmu that has a shared bit
mask.

> 
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     | 61
> > ++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
> >  arch/x86/kvm/mmu/tdp_mmu.h |  5 ++--
> >  3 files changed, 92 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0d6d4506ec97..30c86e858ae4 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm
> > *kvm)
> >          * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock
> > and yield.
> >          */
> >         if (tdp_mmu_enabled)
> > -               kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +               kvm_tdp_mmu_invalidate_all_roots(kvm, true);
> >  
> >         /*
> >          * Notify all vcpus to reload its shadow page table and flush
> > TLB.
> > @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >  
> >         if (tdp_mmu_enabled)
> > -               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> > gfn_end, flush);
> > +               /*
> > +                * zap_private = false. Zap only shared pages.
> > +                *
> > +                * kvm_zap_gfn_range() is used when MTRR or PAT
> > memory
> > +                * type was changed.  Later on the next kvm page
> > fault,
> > +                * populate it with updated spte entry.
> > +                * Because only WB is supported for private pages,
> > don't
> > +                * care of private pages.
> > +                */
> > +               flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> > gfn_end, flush, false);
> >  
> >         if (flush)
> >                 kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
> > gfn_start);
> > @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm
> > *kvm)
> >         kvm_mmu_zap_all(kvm);
> >  }
> >  
> > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct
> > kvm_memory_slot *slot)
> 
> What about kvm_mmu_zap_memslot_leafs() instead?

Ok.


> > +{
> > +       bool flush = false;
> 
> It doesn't need to be initialized if it passes false directly into
> kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to
> understand.
> 
> > +
> > +       write_lock(&kvm->mmu_lock);
> > +
> > +       /*
> > +        * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't
> > required, worst
> > +        * case scenario we'll have unused shadow pages lying around
> > until they
> > +        * are recycled due to age or when the VM is destroyed.
> > +        */
> > +       if (tdp_mmu_enabled) {
> > +               struct kvm_gfn_range range = {
> > +                     .slot = slot,
> > +                     .start = slot->base_gfn,
> > +                     .end = slot->base_gfn + slot->npages,
> > +                     .may_block = true,
> > +
> > +                     /*
> > +                      * This handles both private gfn and shared
> > gfn.
> > +                      * All private page should be zapped on memslot
> > deletion.
> > +                      */
> > +                     .only_private = true,
> > +                     .only_shared = true,
> 
> only_private and only_shared are both true? Shouldn't they both be
> false? (or just unset)

I replied at.
https://lore.kernel.org/kvm/ZUO1Giju0GkUdF0o@google.com/

> 
> > +               };
> > +
> > +               flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range,
> > flush);
> > +       } else {
> > +               /* TDX supports only TDP-MMU case. */
> > +               WARN_ON_ONCE(1);
> 
> How about a KVM_BUG_ON() instead? If somehow this is reached, we don't
> want the caller thinking the pages are zapped, then enter the guest
> with pages mapped that have gone elsewhere.
> 
> > +               flush = true;
> 
> Why flush?

Those are only safe guard. TDX supports only TDP MMU. Let me drop them.


> > +       }
> > +       if (flush)
> > +               kvm_flush_remote_tlbs(kvm);
> > +
> > +       write_unlock(&kvm->mmu_lock);
> > +}
> > +
> >  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> >                                    struct kvm_memory_slot *slot)
> >  {
> > -       kvm_mmu_zap_all_fast(kvm);
> > +       if (kvm_gfn_shared_mask(kvm))
> 
> There seems to be an attempt to abstract away the existence of Secure-
> EPT in mmu.c, that is not fully successful. In this case the code
> checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
> in a way specific needed by S-EPT. It ends up being a little confusing
> because the actual check is about whether there is a shared bit. It
> only works because only S-EPT is the only thing that has a
> kvm_gfn_shared_mask().
> 
> Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
> but is more honest about what we are getting up to here. I'm not sure
> though, what do you think?

Right, I attempted and failed in zapping case.  This is due to the restriction
that the Secure-EPT pages must be removed from the leaves.  the VMX case (also
NPT, even SNP) heavily depends on zapping root entry as optimization.

I can think of
- add TDX check. Looks wrong
- Use kvm_gfn_shared_mask(kvm). confusing
- Give other name for this check like zap_from_leafs (or better name?)
  The implementation is same to kvm_gfn_shared_mask() with comment.
  - Or we can add a boolean variable to struct kvm
  

> >  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index d47f0daf1b03..e7514a807134 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >          * for zapping and thus puts the TDP MMU's reference to each
> > root, i.e.
> >          * ultimately frees all roots.
> >          */
> > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +       kvm_tdp_mmu_invalidate_all_roots(kvm, false);
> >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> >  
> >         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> > @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct
> > kvm_mmu_page *sp)
> >   * operation can cause a soft lockup.
> >   */
> >  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page
> > *root,
> > -                             gfn_t start, gfn_t end, bool can_yield,
> > bool flush)
> > +                             gfn_t start, gfn_t end, bool can_yield,
> > bool flush,
> > +                             bool zap_private)
> >  {
> >         struct tdp_iter iter;
> >  
> > @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> >  
> >         lockdep_assert_held_write(&kvm->mmu_lock);
> >  
> > +       WARN_ON_ONCE(zap_private && !is_private_sp(root));
> 
> All the callers have zap_private as zap_private && is_private_sp(root).
> What badness is it trying to uncover?

I added this during debug. Let me drop it.
-- 
Isaku Yamahata <isaku.yamahata@...el.com>