linux-kernel - Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8fe14f6c3b4a7330c3dc26f82c679334cf70994.camel@intel.com>
Date: Thu, 16 May 2024 02:00:32 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>
CC: "dmatlack@...gle.com" <dmatlack@...gle.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>, "Aktas, Erdem" <erdemaktas@...gle.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>
Subject: Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for
 TDP MMU

On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 12:52:32PM +1200,
> "Huang, Kai" <kai.huang@...el.com> wrote:
> 
> > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > From: Isaku Yamahata <isaku.yamahata@...el.com>
> > > 
> > > Allocate mirrored page table for the private page table and implement MMU
> > > hooks to operate on the private page table.
> > > 
> > > To handle page fault to a private GPA, KVM walks the mirrored page table
> > > in
> > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > > changes from the mirrored page table to private page table.
> > > 
> > >    private KVM page fault   |
> > >        |                    |
> > >        V                    |
> > >   private GPA               |     CPU protected EPTP
> > >        |                    |           |
> > >        V                    |           V
> > >   mirrored PT root          |     private PT root
> > >        |                    |           |
> > >        V                    |           V
> > >     mirrored PT --hook to propagate-->private PT
> > >        |                    |           |
> > >        \--------------------+------\    |
> > >                             |      |    |
> > >                             |      V    V
> > >                             |    private guest page
> > >                             |
> > >                             |
> > >       non-encrypted memory  |    encrypted memory
> > >                             |
> > > 
> > > PT:         page table
> > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > > manages
> > >              this table to map private guest pages.
> > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > >              to propagate PT change to the actual private PT.
> > > 
> > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > > can be modified atomically with mmu_lock held for read, however, the MMU
> > > hooks to private page table are not atomical operations.
> > > 
> > > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > > used when mirrored SPTEs are updated atomically.
> > > 
> > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > >     following steps.
> > > 3. Invoke MMU hooks to modify private page table with the target value.
> > > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > >     (b) On hook failure, restore mirrored SPTE to original value.
> > > 
> > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > > 
> > > This sequence also applies when SPTEs are atomiclly updated from
> > > non-present to present in order to prevent potential conflicts when
> > > multiple vCPUs attempt to set private SPTEs to a different page size
> > > simultaneously, though 4K page size is only supported for private page
> > > table currently.
> > > 
> > > 2M page support can be done in future patches.
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> > > Co-developed-by: Kai Huang <kai.huang@...el.com>
> > > Signed-off-by: Kai Huang <kai.huang@...el.com>
> > > Co-developed-by: Yan Zhao <yan.y.zhao@...el.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> > > ---
> > > TDX MMU Part 1:
> > >   - Remove unnecessary gfn, access twist in
> > >     tdp_mmu_map_handle_target_level(). (Chao Gao)
> > >   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > >     tdp_mmu_alloc_sp()
> > >   - Update comment in set_private_spte_present() (Yan)
> > >   - Open code call to kvm_mmu_init_private_spt() (Yan)
> > >   - Add comments on TDX MMU hooks (Yan)
> > >   - Fix various whitespace alignment (Yan)
> > >   - Remove pointless warnings and conditionals in
> > >     handle_removed_private_spte() (Yan)
> > >   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > >   - Remove incorrect comment in handle_changed_spte() (Yan)
> > >   - Remove unneeded kvm_pfn_to_refcounted_page() and
> > >     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > >   - Do kvm_gfn_for_root() branchless (Rick)
> > >   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > >   - Add comment for stripping shared bit for fault.gfn (Chao)
> > > 
> > > v19:
> > > - drop CONFIG_KVM_MMU_PRIVATE
> > > 
> > > v18:
> > > - Rename freezed => frozen
> > > 
> > > v14 -> v15:
> > > - Refined is_private condition check in kvm_tdp_mmu_map().
> > >    Add kvm_gfn_shared_mask() check.
> > > - catch up for struct kvm_range change
> > > ---
> > >   arch/x86/include/asm/kvm-x86-ops.h |   5 +
> > >   arch/x86/include/asm/kvm_host.h    |  25 +++
> > >   arch/x86/kvm/mmu/mmu.c             |  13 +-
> > >   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
> > >   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> > >   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
> > >   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
> > >   7 files changed, 293 insertions(+), 42 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> > > b/arch/x86/include/asm/kvm-x86-ops.h
> > > index 566d19b02483..d13cb4b8fce6 100644
> > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > >   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > >   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > >   KVM_X86_OP(load_mmu_pgd)
> > > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > >   KVM_X86_OP(has_wbinvd_exit)
> > >   KVM_X86_OP(get_l2_tsc_offset)
> > >   KVM_X86_OP(get_l2_tsc_multiplier)
> > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > b/arch/x86/include/asm/kvm_host.h
> > > index d010ca5c7f44..20fa8fa58692 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > >         int (*sync_spte)(struct kvm_vcpu *vcpu,
> > >                          struct kvm_mmu_page *sp, int i);
> > >         struct kvm_mmu_root_info root;
> > > +       hpa_t private_root_hpa;
> > 
> > Should we have
> > 
> >         struct kvm_mmu_root_info private_root;
> > 
> > instead?
> 
> Yes. And the private root allocation can be pushed down into TDP MMU.

Why?

> 
[snip]
> > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > >   {
> > >         kvm_mmu_unload(vcpu);
> > > +       if (tdp_mmu_enabled) {
> > > +               read_lock(&vcpu->kvm->mmu_lock);
> > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > > >private_root_hpa,
> > > +                                  NULL);
> > > +               read_unlock(&vcpu->kvm->mmu_lock);
> > > +       }
> > 
> > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> > this here.
> > 
> > Could you elaborate?
> > 
> > Anyway, from common code's perspective, we need to have some clarification
> > why we design to do it here.
> 
> This should be cleaned up.  It can be pushed down into
> kvm_tdp_mmu_alloc_root().
> 
> void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
>   allocate shared root
>   if (has_mirrort_pt)
>     allocate private root
> 

Huh? This is kvm_mmu_destroy()...

> 
> > >         free_mmu_pages(&vcpu->arch.root_mmu);
> > >         free_mmu_pages(&vcpu->arch.guest_mmu);
> > >         mmu_free_memory_caches(vcpu);
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > > b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 0f1a9d733d9e..3a7fe9261e23 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -6,6 +6,8 @@
> > >   #include <linux/kvm_host.h>
> > >   #include <asm/kvm_host.h>
> > > +#include "mmu.h"
> > > +
> > >   #ifdef CONFIG_KVM_PROVE_MMU
> > >   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > >   #else
> > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > > kvm_vcpu *vcpu, struct kvm_m
> > >         sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > > >arch.mmu_private_spt_cache);
> > >   }
> > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > > *root,
> > > +                                    gfn_t gfn)
> > > +{
> > > +       gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > > +
> > > +       /* Set shared bit if not private */
> > > +       gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > > kvm_gfn_shared_mask(kvm);
> > > +       return gfn_for_root;
> > > +}
> > > +
> > >   static inline bool kvm_mmu_page_ad_need_write_protect(struct
> > > kvm_mmu_page *sp)
> > >   {
> > >         /*
> > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > >         int r;
> > >         if (vcpu->arch.mmu->root_role.direct) {
> > > -               fault.gfn = fault.addr >> PAGE_SHIFT;
> > > +               /*
> > > +                * Things like memslots don't understand the concept of a
> > > shared
> > > +                * bit. Strip it so that the GFN can be used like normal,
> > > and the
> > > +                * fault.addr can be used when the shared bit is needed.
> > > +                */
> > > +               fault.gfn = gpa_to_gfn(fault.addr) &
> > > ~kvm_gfn_shared_mask(vcpu->kvm);
> > >                 fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > 
> > Again, I don't think it's nessary for fault.gfn to still have the shared bit
> > here?
> > 
> > This kinda usage is pretty much the reason I want to get rid of
> > kvm_gfn_shared_mask().
> 
> We are going to flags like has_mirrored_pt and we have root page table
> iterator
> with types specified.  I'll investigate how we can reduce (or eliminate)
> those helper functions.

Let's transition the abusers off and see whats left. I'm still waiting for an
explanation of why they are bad when uses properly.


[snip]
> 
> > >         /* The level of the root page given to the iterator */
> > >         int root_level;
> > 
> > [...]
> > 
> > >         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > > kvm_vcpu *vcpu,
> > >                 new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > >         else
> > >                 wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > > >gfn,
> > > -                                        fault->pfn, iter->old_spte,
> > > fault->prefetch, true,
> > > -                                        fault->map_writable, &new_spte);
> > > +                                       fault->pfn, iter->old_spte, fault-
> > > >prefetch, true,
> > > +                                       fault->map_writable, &new_spte);
> > >         if (new_spte == iter->old_spte)
> > >                 ret = RET_PF_SPURIOUS;
> > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > > kvm_page_fault *fault)
> > >         struct kvm *kvm = vcpu->kvm;
> > >         struct tdp_iter iter;
> > >         struct kvm_mmu_page *sp;
> > > +       gfn_t raw_gfn;
> > > +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > 
> > Ditto.  I wish we can have 'has_mirrored_private_pt'.
> 
> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?

Why not helpers that wrap vm_type like:
https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@intel.com/