linux-kernel - Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aErGKAHKA1VENLK0@yzhao56-desk.sh.intel.com>
Date: Thu, 12 Jun 2025 20:20:56 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Xiaoyao Li <xiaoyao.li@...el.com>, Paolo Bonzini <pbonzini@...hat.com>,
	<rick.p.edgecombe@...el.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <reinette.chatre@...el.com>,
	<kai.huang@...el.com>, <adrian.hunter@...el.com>, <isaku.yamahata@...el.com>,
	Binbin Wu <binbin.wu@...ux.intel.com>, <tony.lindgren@...ux.intel.com>
Subject: Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for
 KVM_PRE_FAULT_MEMORY

On Wed, Jun 11, 2025 at 11:10:21AM -0700, Sean Christopherson wrote:
> On Tue, Jun 10, 2025, Xiaoyao Li wrote:
> > From: Paolo Bonzini <pbonzini@...hat.com>
> > 
> > Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.
> > 
> > It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
> > shared bit set when the memory attribute of it is shared, and it leads
> > to wrong root in tdp_mmu_get_root_for_fault().
> > 
> > Fix it by embedding the direct bits in the gpa that is passed to
> > kvm_tdp_map_page(), when the memory of the gpa is not private.
> > 
> > [*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> > 
> > Reported-by: Xiaoyao Li <xiaoyao.li@...el.com>
> > Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> > Signed-off-by: Paolo Bonzini <pbonzini@...hat.com>
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
> > ---
> > we have selftests enhancement for TDX case of KVM_PRE_FAULT_MEMORY, but
> > the plan is to post them on top of the TDX selftests [1] when they get
> > upstream.
> > 
> > [1] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index cbc84c6abc2e..a4040578b537 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4896,6 +4896,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  {
> >  	u64 error_code = PFERR_GUEST_FINAL_MASK;
> >  	u8 level = PG_LEVEL_4K;
> > +	u64 direct_bits;
> >  	u64 end;
> >  	int r;
> >  
> > @@ -4910,15 +4911,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	if (r)
> >  		return r;
> >  
> > +	direct_bits = 0;
> >  	if (kvm_arch_has_private_mem(vcpu->kvm) &&
> >  	    kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> >  		error_code |= PFERR_PRIVATE_ACCESS;
> > +	else
> > +		direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> 
> Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
> but stuffing vendor specific GPA bits in common code goes too far.  Actually,
> all of this goes too far.  There's zero reason any code outside of TDX needs to
> *explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.
> 
> Side topic, kvm_arch_vcpu_pre_fault_memory() should disallow aliased GFNs.  It
> will "work", because kvm_mmu_do_page_fault() will strip the SHARED bit, but it's
> all kinds of confusing due to KVM also forcing shared vs. private based on memory
> attributes.
> 
> Back to the main topic, KVM needs to have a single source of truth when it comes
> to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> has far, far too much history and baggage with "direct") is tied to the existence
> and polarity of aliased GFN bits.
> 
> What we have now does work *today* (see this bug), and it will be a complete
> trainwreck if we ever want to steal GFN bits for other reasons.
> 
> To detect a mirror fault:
> 
>   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>   {
> 	return kvm_has_mirrored_tdp(kvm) &&
> 	       error_code & PFERR_PRIVATE_ACCESS;
>   }
> 
> And for TDX, it should darn well explicitly track the shared GPA mask:
> 
>   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
>   {
> 	/* For TDX the direct mask is the shared mask. */
> 	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
>   }
> 
> Overloading a field in kvm_arch and bleeding TDX details into common code isn't
> worth saving 8 bytes per VM.
> 
> Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
> the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
> 
> Compile tested only, and obviously needs to be split into multiple patches.
> 
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  3 ++
>  arch/x86/include/asm/kvm_host.h    |  5 +-
>  arch/x86/kvm/mmu.h                 | 21 +++++---
>  arch/x86/kvm/mmu/mmu.c             |  3 ++
>  arch/x86/kvm/mmu/mmu_internal.h    | 14 +-----
>  arch/x86/kvm/mmu/tdp_iter.c        |  4 +-
>  arch/x86/kvm/mmu/tdp_iter.h        | 12 ++---
>  arch/x86/kvm/mmu/tdp_mmu.c         | 13 ++---
>  arch/x86/kvm/mmu/tdp_mmu.h         |  2 +-
>  arch/x86/kvm/vmx/common.h          | 17 ++++---
>  arch/x86/kvm/vmx/main.c            | 11 +++++
>  arch/x86/kvm/vmx/tdx.c             | 78 ++++++++++++++++++------------
>  arch/x86/kvm/vmx/tdx.h             |  2 +
>  arch/x86/kvm/vmx/vmx.c             |  2 +-
>  arch/x86/kvm/vmx/x86_ops.h         |  1 +
>  15 files changed, 112 insertions(+), 76 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8d50e3e0a19b..f0a958ba4823 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -145,6 +145,9 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>  KVM_X86_OP_OPTIONAL(get_untagged_addr)
>  KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> +#ifdef CONFIG_X86_64
> +KVM_X86_OP_OPTIONAL_RET0(get_tdp_mmu_root_gfn);
> +#endif
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
>  KVM_X86_OP_OPTIONAL(gmem_invalidate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 330cdcbed1a6..3a5efde1ab9d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1578,7 +1578,7 @@ struct kvm_arch {
>  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
>  	struct kvm_mmu_memory_cache split_desc_cache;
>  
> -	gfn_t gfn_direct_bits;
> +	gfn_t aliased_gfn_bits;
>  
>  	/*
>  	 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A Zero
> @@ -1897,6 +1897,9 @@ struct kvm_x86_ops {
>  
>  	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> +#ifdef CONFIG_X86_64
> +	gfn_t (*get_tdp_mmu_root_gfn)(struct kvm *kvm, bool is_mirror);
> +#endif
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
>  	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>  	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..cef726e59f9b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -257,7 +257,7 @@ extern bool tdp_mmu_enabled;
>  #define tdp_mmu_enabled false
>  #endif
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> @@ -309,20 +309,25 @@ static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
>  	return kvm->arch.vm_type == KVM_X86_TDX_VM;
>  }
>  
> -static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
> +static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>  {
> -	return kvm->arch.gfn_direct_bits;
> +	return kvm_has_mirrored_tdp(kvm) &&
> +	       error_code & PFERR_PRIVATE_ACCESS;
>  }
What about passing is is_private instead?  

static inline bool kvm_is_mirror_fault(struct kvm *kvm, bool is_private)
{
 	return kvm_has_mirrored_tdp(kvm) && is_private;
}

tdp_mmu_get_root_for_fault() and kvm_tdp_mmu_gpa_is_mapped() can pass in
faul->is_private or is_private directly, leaving the parsing of error_code &
PFERR_PRIVATE_ACCESS only in kvm_mmu_do_page_fault().

> -static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
> +static inline gfn_t kvm_aliased_gfn_bits(struct kvm *kvm)
>  {
> -	gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
> -
> -	return !gpa_direct_bits || (gpa & gpa_direct_bits);
> +	return kvm->arch.aliased_gfn_bits;
>  }
>  
>  static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
>  {
> -	return gfn & kvm_gfn_direct_bits(kvm);
> +	return gfn & kvm_aliased_gfn_bits(kvm);
>  }
> +
> +static inline gpa_t kvm_get_unaliased_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	return gpa & ~gfn_to_gpa(kvm_aliased_gfn_bits(kvm));
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..0228d49ac363 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4902,6 +4902,9 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	if (!vcpu->kvm->arch.pre_fault_allowed)
>  		return -EOPNOTSUPP;
>  
> +	if (kvm_is_gfn_alias(vcpu->kvm, gpa_to_gfn(range->gpa)))
> +		return -EINVAL;
> +
>  	/*
>  	 * reload is efficient when called repeatedly, so we can do it on
>  	 * every iteration.
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index db8f33e4de62..7d2c53d2b0ca 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -175,18 +175,6 @@ static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_
>  	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
>  }
>  
> -static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct kvm_mmu_page *root)
> -{
> -	/*
> -	 * Since mirror SPs are used only for TDX, which maps private memory
> -	 * at its "natural" GFN, no mask needs to be applied to them - and, dually,
> -	 * we expect that the bits is only used for the shared PT.
> -	 */
> -	if (is_mirror_sp(root))
> -		return 0;
> -	return kvm_gfn_direct_bits(kvm);
> -}
> -
>  static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm *kvm,
>  						      struct kvm_mmu_page *sp)
>  {
> @@ -376,7 +364,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		 * bit. Strip it so that the GFN can be used like normal, and the
>  		 * fault.addr can be used when the shared bit is needed.
>  		 */
> -		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_bits(vcpu->kvm);
> +		fault.gfn = gpa_to_gfn(fault.addr) & ~vcpu->kvm->arch.aliased_gfn_bits;
>  		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>  	}
>  
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index 9e17bfa80901..c36bfb920382 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -37,8 +37,10 @@ void tdp_iter_restart(struct tdp_iter *iter)
>   * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
>   */
>  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> -		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits)
> +		    int min_level, gfn_t next_last_level_gfn, bool paranoid)
>  {
> +	gfn_t gfn_bits = paranoid ? 0 : root->gfn;
> +
>  	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
>  			 (root->role.level > PT64_ROOT_MAX_LEVEL) ||
>  			 (gfn_bits && next_last_level_gfn >= gfn_bits))) {
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 364c5da6c499..5117d64952f7 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -120,13 +120,13 @@ struct tdp_iter {
>   * Iterates over every SPTE mapping the GFN range [start, end) in a
>   * preorder traversal.
>   */
> -#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)		  \
> -	for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_bits(kvm, root)); \
> -	     iter.valid && iter.gfn < end;						  \
> +#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)	\
> +	for (tdp_iter_start(&iter, root, min_level, start, false);		\
> +	     iter.valid && iter.gfn < end;					\
>  	     tdp_iter_next(&iter))
>  
> -#define for_each_tdp_pte_min_level_all(iter, root, min_level)		\
> -	for (tdp_iter_start(&iter, root, min_level, 0, 0);		\
> +#define for_each_tdp_pte_min_level_paranoid(iter, root, min_level)	\
> +	for (tdp_iter_start(&iter, root, min_level, 0, true);		\
>  		iter.valid && iter.gfn < tdp_mmu_max_gfn_exclusive();	\
>  		tdp_iter_next(&iter))
>  
> @@ -136,7 +136,7 @@ struct tdp_iter {
>  tdp_ptep_t spte_to_child_pt(u64 pte, int level);
>  
>  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> -		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits);
> +		    int min_level, gfn_t next_last_level_gfn, bool paranoid);
>  void tdp_iter_next(struct tdp_iter *iter);
>  void tdp_iter_restart(struct tdp_iter *iter);
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..15daf4353ccc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -257,6 +257,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	int as_id = kvm_mmu_role_as_id(role);
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_page *root;
> +	gfn_t gfn;
>  
>  	if (mirror)
>  		role.is_mirror = true;
> @@ -291,7 +292,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	}
>  
>  	root = tdp_mmu_alloc_sp(vcpu);
> -	tdp_mmu_init_sp(root, NULL, 0, role);
> +	gfn = kvm_x86_call(get_tdp_mmu_root_gfn)(vcpu->kvm, mirror);
> +	tdp_mmu_init_sp(root, NULL, gfn, role);
>  
>  	/*
>  	 * TDP MMU roots are kept until they are explicitly invalidated, either
> @@ -860,7 +862,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  {
>  	struct tdp_iter iter;
>  
> -	for_each_tdp_pte_min_level_all(iter, root, zap_level) {
> +	for_each_tdp_pte_min_level_paranoid(iter, root, zap_level) {
>  retry:
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
>  			continue;
> @@ -1934,12 +1936,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>  	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
>  }
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code)
bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, bool is_private)

>  {
>  	struct kvm *kvm = vcpu->kvm;
> -	bool is_direct = kvm_is_addr_direct(kvm, gpa);
> -	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
> -				 vcpu->arch.mmu->mirror_root_hpa;
> +	hpa_t root = kvm_is_mirror_fault(kvm, error_code) ? vcpu->arch.mmu->mirror_root_hpa :
> +							    vcpu->arch.mmu->root.hpa;
hpa_t root = kvm_is_mirror_fault(kvm, is_private) ? vcpu->arch.mmu->mirror_root_hpa :
							    vcpu->arch.mmu->root.hpa;

>  	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
>  	int leaf;



> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 52acf99d40a0..397309dfc73f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(str
>  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
>  							      struct kvm_page_fault *fault)
>  {
> -	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> +	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->error_code)))
So, here can be
	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->is_private)))

>  		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
>  
>  	return root_to_sp(vcpu->arch.mmu->root.hpa);
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index a0c5e8781c33..5065fd8d41e8 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -76,14 +76,9 @@ static __always_inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
>  
>  #endif
>  
> -static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
> -{
> -	/* For TDX the direct mask is the shared mask. */
> -	return !kvm_is_addr_direct(kvm, gpa);
> -}
> -
>  static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -					     unsigned long exit_qualification)
> +					     unsigned long exit_qualification,
> +					     bool is_private)
>  {
>  	u64 error_code;
>  
> @@ -104,12 +99,18 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>  		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
>  			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>  
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
> +	if (is_private)
>  		error_code |= PFERR_PRIVATE_ACCESS;
>  
>  	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>  }
>  
> +static inline int vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> +					   unsigned long exit_qualification)
> +{
> +	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, false);
> +}
> +
>  static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
>  						     int pi_vec)
>  {
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..6e0652ed0d22 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -855,6 +855,14 @@ static void vt_setup_mce(struct kvm_vcpu *vcpu)
>  	vmx_setup_mce(vcpu);
>  }
>  
> +static gfn_t vt_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
> +{
> +	if (!is_td(kvm))
> +		return 0;
> +
> +	return tdx_get_tdp_mmu_root_gfn(kvm, is_mirror);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -1041,6 +1049,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.get_untagged_addr = vmx_get_untagged_addr,
>  
> +#ifdef CONFIG_X86_64
> +	.get_tdp_mmu_root_gfn = vt_op_tdx_only(get_tdp_mmu_root_gfn),
> +#endif
>  	.mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl),
>  	.vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl),
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1abf3c158cd5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1138,6 +1138,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>  	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);
>  }
>  
> +static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	/* For TDX the direct mask is the shared mask. */
> +	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
> +}
> +
>  /*
>   * Split into chunks and check interrupt pending between chunks.  This allows
>   * for timely injection of interrupts to prevent issues with guest lockup
> @@ -1192,9 +1198,9 @@ static void __tdx_map_gpa(struct vcpu_tdx *tdx)
>  	 * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU.
>  	 */
>  	tdx->vcpu.run->hypercall.ret = 0;
> -	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
> +	tdx->vcpu.run->hypercall.args[0] = gfn_to_gpa(kvm_get_unaliased_gpa(tdx->vcpu.kvm, gpa));
No need to do gfn_to_gpa().

>  	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
> -	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
> +	tdx->vcpu.run->hypercall.args[2] = tdx_is_private_gpa(tdx->vcpu.kvm, gpa) ?
>  					   KVM_MAP_GPA_RANGE_ENCRYPTED :
>  					   KVM_MAP_GPA_RANGE_DECRYPTED;
>  	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
> @@ -1222,8 +1228,8 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>  
>  	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>  	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) ||
> -	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
> -	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) {
> +	    (tdx_is_private_gpa(vcpu->kvm, gpa) !=
> +	     tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))) {
>  		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>  		goto error;
>  	}
> @@ -1411,11 +1417,11 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>  	 * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to
>  	 * do MMIO emulation for private GPA.
>  	 */
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) ||
> -	    vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))
> +	if (tdx_is_private_gpa(vcpu->kvm, gpa) ||
> +	    tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))
>  		goto error;
>  
> -	gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> +	gpa = kvm_get_unaliased_gpa(vcpu->kvm, gpa);
>  
>  	if (write)
>  		r = tdx_mmio_write(vcpu, gpa, size, val);
> @@ -1480,12 +1486,17 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>  
> +gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
> +{
> +	return is_mirror ? 0 : gpa_to_gfn(to_kvm_tdx(kvm)->shared_gpa_mask);
> +}
> +
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  {
>  	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
>  			  TDX_SHARED_BIT_PWL_4;
>  
> -	if (KVM_BUG_ON(shared_bit != kvm_gfn_direct_bits(vcpu->kvm), vcpu->kvm))
> +	if (KVM_BUG_ON(tdx_is_private_gpa(vcpu->kvm, shared_bit), vcpu->kvm))
>  		return;
Should be
if (KVM_BUG_ON(tdx_is_private_gpa(vcpu->kvm, gfn_to_gpa(shared_bit)), vcpu->kvm))

or remove the gpa_to_gfn in TDX_SHARED_BIT_PWL_5, TDX_SHARED_BIT_PWL_4.



>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> @@ -1837,10 +1848,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qual;
>  	gpa_t gpa = to_tdx(vcpu)->exit_gpa;
> -	bool local_retry = false;
> +	bool is_private = tdx_is_private_gpa(vcpu->kvm, gpa);
>  	int ret;
>  
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
> +	if (is_private) {
>  		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
>  			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
>  				gpa, vcpu->vcpu_id);
> @@ -1857,9 +1868,6 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  		 * due to aliasing a single HPA to multiple GPAs.
>  		 */
>  		exit_qual = EPT_VIOLATION_ACC_WRITE;
> -
> -		/* Only private GPA triggers zero-step mitigation */
> -		local_retry = true;
>  	} else {
>  		exit_qual = vmx_get_exit_qual(vcpu);
>  		/*
> @@ -1907,9 +1915,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  	 * handle retries locally in their EPT violation handlers.
>  	 */
>  	while (1) {
> -		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual, is_private);
>  
> -		if (ret != RET_PF_RETRY || !local_retry)
> +		/* Only private GPA triggers zero-step mitigation */
> +		if (ret != RET_PF_RETRY || !is_private)
>  			break;
>  
>  		if (kvm_vcpu_has_events(vcpu) || signal_pending(current))
> @@ -2620,12 +2629,11 @@ static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
>  	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
>  
>  	/*
> -	 * Work around missing support on old TDX modules, fetch
> -	 * guest maxpa from gfn_direct_bits.
> +	 * Work around missing support on old TDX modules, derive the guest
> +	 * MAXPA from the shared bit.
>  	 */
>  	if (leaf == 0x80000008) {
> -		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> -		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
> +		unsigned int g_maxpa = __ffs(kvm_tdx->shared_gpa_mask) + 1;
>  
>  		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
>  	}
> @@ -2648,6 +2656,7 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct kvm_tdx_init_vm *init_vm;
>  	struct td_params *td_params = NULL;
> +	gpa_t shared_bit;
>  	int ret;
>  
>  	BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid));
> @@ -2712,9 +2721,12 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	kvm_tdx->xfam = td_params->xfam;
>  
>  	if (td_params->config_flags & TDX_CONFIG_FLAGS_MAX_GPAW)
> -		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_5;
> +		shared_bit = TDX_SHARED_BIT_PWL_5;
>  	else
> -		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_4;
> +		shared_bit = TDX_SHARED_BIT_PWL_4;
> +
> +	kvm_tdx->shared_gpa_mask = shared_bit;
> +	kvm->arch.aliased_gfn_bits = gpa_to_gfn(shared_bit);
Should be
       kvm_tdx->shared_gpa_mask = gfn_to_gpa(shared_bit);
       kvm->arch.aliased_gfn_bits = shared_bit;

Or remove the gpa_to_gfn in TDX_SHARED_BIT_PWL_5, TDX_SHARED_BIT_PWL_4


>  
>  	kvm_tdx->state = TD_STATE_INITIALIZED;
>  out:
> @@ -3052,6 +3064,16 @@ struct tdx_gmem_post_populate_arg {
>  	__u32 flags;
>  };
>  
> +static int tdx_is_private_mapping_valid(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU))
> +		return 0;
Should be "return 1".

> +
> +	guard(read_lock)(&vcpu->kvm->mmu_lock);
> +
> +	return kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, PFERR_PRIVATE_ACCESS);
So, here can be
	return kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, true);
> +}
> +
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  				  void __user *src, int order, void *_arg)
>  {
> @@ -3084,14 +3106,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	 * because all paths are covered by slots_lock and the
>  	 * filemap invalidate lock.  Check that they are indeed enough.
>  	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> +	if (KVM_BUG_ON(!tdx_is_private_mapping_valid(vcpu, gpa), kvm))
> +		return -EIO;
>  
>  	ret = 0;
>  	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> @@ -3148,8 +3164,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
>  	    !region.nr_pages ||
>  	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> -	    !vt_is_tdx_private_gpa(kvm, region.gpa) ||
> -	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
> +	    !tdx_is_private_gpa(kvm, region.gpa) ||
> +	    !tdx_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
>  		return -EINVAL;
>  
>  	kvm_mmu_reload(vcpu);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 51f98443e8a2..3807562d5b48 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -27,6 +27,8 @@ struct kvm_tdx {
>  	int hkid;
>  	enum kvm_tdx_state state;
>  
> +	gpa_t shared_gpa_mask;
> +
>  	u64 attributes;
>  	u64 xfam;
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9ff00ae9f05a..38103efe2f47 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5775,7 +5775,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
>  		return kvm_emulate_instruction(vcpu, 0);
>  
> -	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +	return vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index b4596f651232..ee56e96b4db3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -151,6 +151,7 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  
> +gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror);
>  int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  			      enum pg_level level, void *private_spt);
>  int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> 
> base-commit: 61374cc145f4a56377eaf87c7409a97ec7a34041
> --
>