linux-kernel - Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12afae41-906c-4bb7-956a-d73734c68010@intel.com>
Date: Thu, 16 May 2024 12:52:32 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Rick Edgecombe <rick.p.edgecombe@...el.com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>, <kvm@...r.kernel.org>
CC: <linux-kernel@...r.kernel.org>, <isaku.yamahata@...il.com>,
	<erdemaktas@...gle.com>, <sagis@...gle.com>, <yan.y.zhao@...el.com>,
	<dmatlack@...gle.com>
Subject: Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for
 TDP MMU



On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@...el.com>
> 
> Allocate mirrored page table for the private page table and implement MMU
> hooks to operate on the private page table.
> 
> To handle page fault to a private GPA, KVM walks the mirrored page table in
> unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> changes from the mirrored page table to private page table.
> 
>    private KVM page fault   |
>        |                    |
>        V                    |
>   private GPA               |     CPU protected EPTP
>        |                    |           |
>        V                    |           V
>   mirrored PT root          |     private PT root
>        |                    |           |
>        V                    |           V
>     mirrored PT --hook to propagate-->private PT
>        |                    |           |
>        \--------------------+------\    |
>                             |      |    |
>                             |      V    V
>                             |    private guest page
>                             |
>                             |
>       non-encrypted memory  |    encrypted memory
>                             |
> 
> PT:         page table
> Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
>              this table to map private guest pages.
> Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
>              to propagate PT change to the actual private PT.
> 
> SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> can be modified atomically with mmu_lock held for read, however, the MMU
> hooks to private page table are not atomical operations.
> 
> To address it, a special REMOVED_SPTE is introduced and below sequence is
> used when mirrored SPTEs are updated atomically.
> 
> 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
>     following steps.
> 3. Invoke MMU hooks to modify private page table with the target value.
> 4. (a) On hook succeeds, update mirrored SPTE to target value.
>     (b) On hook failure, restore mirrored SPTE to original value.
> 
> KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> 
> This sequence also applies when SPTEs are atomiclly updated from
> non-present to present in order to prevent potential conflicts when
> multiple vCPUs attempt to set private SPTEs to a different page size
> simultaneously, though 4K page size is only supported for private page
> table currently.
> 
> 2M page support can be done in future patches.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> Co-developed-by: Kai Huang <kai.huang@...el.com>
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> ---
> TDX MMU Part 1:
>   - Remove unnecessary gfn, access twist in
>     tdp_mmu_map_handle_target_level(). (Chao Gao)
>   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
>     tdp_mmu_alloc_sp()
>   - Update comment in set_private_spte_present() (Yan)
>   - Open code call to kvm_mmu_init_private_spt() (Yan)
>   - Add comments on TDX MMU hooks (Yan)
>   - Fix various whitespace alignment (Yan)
>   - Remove pointless warnings and conditionals in
>     handle_removed_private_spte() (Yan)
>   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
>   - Remove incorrect comment in handle_changed_spte() (Yan)
>   - Remove unneeded kvm_pfn_to_refcounted_page() and
>     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
>   - Do kvm_gfn_for_root() branchless (Rick)
>   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
>   - Add comment for stripping shared bit for fault.gfn (Chao)
> 
> v19:
> - drop CONFIG_KVM_MMU_PRIVATE
> 
> v18:
> - Rename freezed => frozen
> 
> v14 -> v15:
> - Refined is_private condition check in kvm_tdp_mmu_map().
>    Add kvm_gfn_shared_mask() check.
> - catch up for struct kvm_range change
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |   5 +
>   arch/x86/include/asm/kvm_host.h    |  25 +++
>   arch/x86/kvm/mmu/mmu.c             |  13 +-
>   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
>   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
>   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
>   7 files changed, 293 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 566d19b02483..d13cb4b8fce6 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
>   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>   KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_private_spt)
> +KVM_X86_OP_OPTIONAL(free_private_spt)
> +KVM_X86_OP_OPTIONAL(set_private_spte)
> +KVM_X86_OP_OPTIONAL(remove_private_spte)
> +KVM_X86_OP_OPTIONAL(zap_private_spte)
>   KVM_X86_OP(has_wbinvd_exit)
>   KVM_X86_OP(get_l2_tsc_offset)
>   KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d010ca5c7f44..20fa8fa58692 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -470,6 +470,7 @@ struct kvm_mmu {
>   	int (*sync_spte)(struct kvm_vcpu *vcpu,
>   			 struct kvm_mmu_page *sp, int i);
>   	struct kvm_mmu_root_info root;
> +	hpa_t private_root_hpa;

Should we have

	struct kvm_mmu_root_info private_root;

instead?

>   	union kvm_cpu_role cpu_role;
>   	union kvm_mmu_page_role root_role;
>   
> @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
>   	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   			     int root_level);
>   
> +	/* Add a page as page table page into private page table */
> +	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *private_spt);
> +	/*
> +	 * Free a page table page of private page table.
> +	 * Only expected to be called when guest is not active, specifically
> +	 * during VM destruction phase.
> +	 */
> +	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *private_spt);
> +
> +	/* Add a guest private page into private page table */
> +	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				kvm_pfn_t pfn);
> +
> +	/* Remove a guest private page from private page table*/
> +	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				   kvm_pfn_t pfn);
> +	/*
> +	 * Keep a guest private page mapped in private page table, but clear its
> +	 * present bit
> +	 */
> +	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> +
>   	bool (*has_wbinvd_exit)(void);
>   
>   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 76f92cb37a96..2506d6277818 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>   	int r;
>   
>   	if (tdp_mmu_enabled) {
> -		kvm_tdp_mmu_alloc_root(vcpu);
> +		if (kvm_gfn_shared_mask(vcpu->kvm))
> +			kvm_tdp_mmu_alloc_root(vcpu, true);

As mentioned in replies to other patches, I kinda prefer

	kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)

Or we have a helper

	kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()

> +		kvm_tdp_mmu_alloc_root(vcpu, false);
>   		return 0;
>   	}
>   
> @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>   		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
>   			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> -			gfn_t base = gfn_round_for_level(fault->gfn,
> +			gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
>   							 fault->max_level);

I thought by reaching here the shared bit has already been stripped away 
by the caller?

It doesn't make a lot sense to still have it here, given we have a 
universal KVM-defined PFERR_PRIVATE_ACCESS flag:

https://lore.kernel.org/kvm/20240507155817.3951344-2-pbonzini@redhat.com/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e

IMHO we should just strip the shared bit in the TDX variant of 
handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA 
doesn't hvae shared bit) to the common fault handler so it can correctly 
set fault->is_private to true.


>   
>   			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>   
>   	mmu->root.hpa = INVALID_PAGE;
>   	mmu->root.pgd = 0;
> +	mmu->private_root_hpa = INVALID_PAGE;
>   	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>   		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>   
> @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
>   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>   {
>   	kvm_mmu_unload(vcpu);
> +	if (tdp_mmu_enabled) {
> +		read_lock(&vcpu->kvm->mmu_lock);
> +		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> +				   NULL);
> +		read_unlock(&vcpu->kvm->mmu_lock);
> +	}

Hmm.. I don't quite like this, but sorry I kinda forgot why we need to 
to this here.

Could you elaborate?

Anyway, from common code's perspective, we need to have some 
clarification why we design to do it here.

>   	free_mmu_pages(&vcpu->arch.root_mmu);
>   	free_mmu_pages(&vcpu->arch.guest_mmu);
>   	mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0f1a9d733d9e..3a7fe9261e23 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>   #include <linux/kvm_host.h>
>   #include <asm/kvm_host.h>
>   
> +#include "mmu.h"
> +
>   #ifdef CONFIG_KVM_PROVE_MMU
>   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>   #else
> @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
>   	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
>   }
>   
> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +				     gfn_t gfn)
> +{
> +	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> +
> +	/* Set shared bit if not private */
> +	gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
> +	return gfn_for_root;
> +}
> +
>   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
>   {
>   	/*
> @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
>   	int r;
>   
>   	if (vcpu->arch.mmu->root_role.direct) {
> -		fault.gfn = fault.addr >> PAGE_SHIFT;
> +		/*
> +		 * Things like memslots don't understand the concept of a shared
> +		 * bit. Strip it so that the GFN can be used like normal, and the
> +		 * fault.addr can be used when the shared bit is needed.
> +		 */
> +		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
>   		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);

Again, I don't think it's nessary for fault.gfn to still have the shared 
bit here?

This kinda usage is pretty much the reason I want to get rid of 
kvm_gfn_shared_mask().

>   	}
>   
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index fae559559a80..8a64bcef9deb 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -91,7 +91,7 @@ struct tdp_iter {
>   	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>   	/* A pointer to the current SPTE */
>   	tdp_ptep_t sptep;
> -	/* The lowest GFN mapped by the current SPTE */
> +	/* The lowest GFN (shared bits included) mapped by the current SPTE */
>   	gfn_t gfn;

IMHO we need more clarification of this design.

We at least needs to call out the TDX hardware uses the 'GFA + shared 
bit' when it walks the page table for shared mappings, so we must set up 
the mapping at the GPA with the shared bit.

E.g, because TDX hardware uses separate root for shared/private 
mappings, I think it's a resonable opion for the TDX hardware to just 
use the actual GPA w/o shared bit when it walks the shared page table, 
and still report EPT violation with GPA with shared bit set.

Such HW implementation is completely hidden from software, thus should 
be clarified in the changelog/comments.


>   	/* The level of the root page given to the iterator */
>   	int root_level;

[...]

>   	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>   		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>   	else
>   		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> -					 fault->pfn, iter->old_spte, fault->prefetch, true,
> -					 fault->map_writable, &new_spte);
> +					fault->pfn, iter->old_spte, fault->prefetch, true,
> +					fault->map_writable, &new_spte);
>   
>   	if (new_spte == iter->old_spte)
>   		ret = RET_PF_SPURIOUS;
> @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	struct kvm *kvm = vcpu->kvm;
>   	struct tdp_iter iter;
>   	struct kvm_mmu_page *sp;
> +	gfn_t raw_gfn;
> +	bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);

Ditto.  I wish we can have 'has_mirrored_private_pt'.