linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250711151719.goee7eqti4xyhsqr@amd.com>
Date: Fri, 11 Jul 2025 10:17:19 -0500
From: Michael Roth <michael.roth@....com>
To: Yan Zhao <yan.y.zhao@...el.com>
CC: Sean Christopherson <seanjc@...gle.com>, <pbonzini@...hat.com>,
	<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kai.huang@...el.com>,
	<adrian.hunter@...el.com>, <reinette.chatre@...el.com>,
	<xiaoyao.li@...el.com>, <tony.lindgren@...el.com>,
	<binbin.wu@...ux.intel.com>, <dmatlack@...gle.com>,
	<isaku.yamahata@...el.com>, <ira.weiny@...el.com>, <vannapurve@...gle.com>,
	<david@...hat.com>, <ackerleytng@...gle.com>, <tabba@...gle.com>,
	<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from
 kvm_gmem_populate()

On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote:
> > On Wed, Jul 09, 2025, Michael Roth wrote:
> > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > > to use open code to populate the initial memory region into the mirror page
> > > > table, and add the region to S-EPT.
> > > > 
> > > > Background
> > > > ===
> > > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > > interface [2] to help TDX populate init memory region.
> > 
> > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> > after all :-)
> > 
> > > > tdx_vcpu_init_mem_region
> > > >     guard(mutex)(&kvm->slots_lock)
> > > >     kvm_gmem_populate
> > > >         filemap_invalidate_lock(file->f_mapping)
> > > >             __kvm_gmem_get_pfn      //1. get private PFN
> > > >             post_populate           //tdx_gmem_post_populate
> > > >                 get_user_pages_fast //2. get source page
> > > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > > >                                          source page to it.
> > > > 
> > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > > invalidate lock also helps ensure the private PFN remains valid when
> > > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > > 
> > > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > > helps on sharing common code between SEV-SNP and TDX.
> > > > 
> > > > Problem
> > > > ===
> > > > (1)
> > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > > 
> > > > kvm_gmem_get_pfn
> > > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > > 
> > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > > filemap invalidation lock.
> > > 
> > > Bringing the prior discussion over to here: it seems wrong that
> > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > > chain, because:
> > > 
> > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> > >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> > >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> > >    that we are trying to work around. If we could simply pass that PFN down
> > >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> > >    deadlocks in the first place.
> > 
> > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> > 
> > > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> > >    memory, and allows the post_populate callback to handle setting
> > >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> > >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> > >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> > >    kvm_gmem_populate() chain brings things 2 things in conflict with
> > >    each other, and TDX seems to be relying on that fact that it doesn't
> > >    implement a handler for kvm_arch_gmem_prepare(). 
> > > 
> > > I don't think this hurts anything in the current code, and I don't
> > > personally see any issue with open-coding the population path if it doesn't
> > > fit TDX very well, but there was some effort put into making
> > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > > design of the interface itself, but instead just some inflexibility on the
> > > KVM MMU mapping side, then it seems more robust to address the latter if
> > > possible.
> > > 
> > > Would something like the below be reasonable? 
> > 
> > No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> > be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> > that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> > 
> > I would much rather special case this path, because it absolutely is a special
> > snowflake.  This even eliminates several exports of low level helpers that frankly
> > have no business being used by TDX, e.g. kvm_mmu_reload().
> > 
> > ---
> >  arch/x86/kvm/mmu.h         |  2 +-
> >  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
> >  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
> >  4 files changed, 78 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index b4b6860ab971..9cd7a34333af 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
> >  #endif
> >  
> >  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
> >  
> >  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> >  {
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6e838cb6c9e1..bc937f8ed5a0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	return direct_page_fault(vcpu, fault);
> >  }
> >  
> > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> > +				 u64 error_code, u8 *level)
> >  {
> >  	int r;
> >  
> > @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> >  		return -EIO;
> >  	}
> >  }
> > -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> >  
> >  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  				    struct kvm_pre_fault_memory *range)
> > @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	 * Shadow paging uses GVA for kvm page fault, so restrict to
> >  	 * two-dimensional paging.
> >  	 */
> > -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
> >  	if (r < 0)
> >  		return r;
> >  
> > @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	return min(range->size, end - range->gpa);
> >  }
> >  
> > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +	struct kvm_page_fault fault = {
> > +		.addr = gfn_to_gpa(gfn),
> > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > +		.prefetch = true,
> > +		.is_tdp = true,
> > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > +
> > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > +		.req_level = PG_LEVEL_4K,
> kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M,
> because the private_max_mapping_level hook is only invoked in
> kvm_mmu_faultin_pfn_gmem().
> 
> Updating lpage_info can fix it though.
> 
> > +		.goal_level = PG_LEVEL_4K,
> > +		.is_private = true,
> > +
> > +		.gfn = gfn,
> > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > +		.pfn = pfn,
> > +		.map_writable = true,
> > +	};
> > +	struct kvm *kvm = vcpu->kvm;
> > +	int r;
> > +
> > +	lockdep_assert_held(&kvm->slots_lock);
> > +
> > +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> > +		return -EIO;
> > +
> > +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> > +		return -EPERM;
> > +
> > +	r = kvm_mmu_reload(vcpu);
> > +	if (r)
> > +		return r;
> > +
> > +	r = mmu_topup_memory_caches(vcpu, false);
> > +	if (r)
> > +		return r;
> > +
> > +	do {
> > +		if (signal_pending(current))
> > +			return -EINTR;
> > +
> > +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> > +			return -EIO;
> > +
> > +		cond_resched();
> > +
> > +		guard(read_lock)(&kvm->mmu_lock);
> > +
> > +		r = kvm_tdp_mmu_map(vcpu, &fault);
> > +	} while (r == RET_PF_RETRY);
> > +
> > +	if (r != RET_PF_FIXED)
> > +		return -EIO;
> > +
> > +	/*
> > +	 * The caller is responsible for ensuring that no MMU invalidations can
> > +	 * occur.  Sanity check that the mapping hasn't been zapped.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> > +		cond_resched();
> > +
> > +		scoped_guard(read_lock, &kvm->mmu_lock) {
> > +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> > +				return -EIO;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> 
> Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> log:
> 
> Problem
> ===
> ...
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
> 
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock

I wouldn't expect kvm_gmem_fault_shared() to trigger for the
KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
There was some discussion during previous guest_memfd upstream call
(May/June?) about whether to continue using kvm_gmem_populate() (or the
callback you hand it) to handle initializing memory contents before
in-place encryption, verses just expecting that userspace will
initialize the contents directly via mmap() prior to issuing any calls
that trigger kvm_gmem_populate().

I was planning on enforcing that the 'src' parameter to
kvm_gmem_populate() must be NULL for cases where
KVM_MEMSLOT_SUPPORTS_GMEM_SHARED is set, or otherwise it will return
-EINVAL, because:

1) it avoids this awkward path you mentioned where kvm_gmem_fault_shared()
   triggers during kvm_gmem_populate()
2) it makes no sense to have to have to copy anything from 'src' when we
   now support in-place update

For the SNP side, that will require a small API update for
SNP_LAUNCH_UPDATE that mandates that corresponding 'uaddr' argument is
ignored/disallowed in favor of in-place initialization from userspace via
mmap(). Not sure if TDX would need similar API update.

Would that work on the TDX side as well?

Thanks,

Mike