linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHR3gJ7zaedTjitU@yzhao56-desk>
Date: Mon, 14 Jul 2025 11:20:32 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Michael Roth <michael.roth@....com>
CC: Sean Christopherson <seanjc@...gle.com>, <pbonzini@...hat.com>,
	<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kai.huang@...el.com>,
	<adrian.hunter@...el.com>, <reinette.chatre@...el.com>,
	<xiaoyao.li@...el.com>, <tony.lindgren@...el.com>,
	<binbin.wu@...ux.intel.com>, <dmatlack@...gle.com>,
	<isaku.yamahata@...el.com>, <ira.weiny@...el.com>, <vannapurve@...gle.com>,
	<david@...hat.com>, <ackerleytng@...gle.com>, <tabba@...gle.com>,
	<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from
 kvm_gmem_populate()

On Fri, Jul 11, 2025 at 10:17:19AM -0500, Michael Roth wrote:
> On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Jul 09, 2025, Michael Roth wrote:
> > > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > > > to use open code to populate the initial memory region into the mirror page
> > > > > table, and add the region to S-EPT.
> > > > > 
> > > > > Background
> > > > > ===
> > > > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > > > interface [2] to help TDX populate init memory region.
> > > 
> > > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> > > after all :-)
> > > 
> > > > > tdx_vcpu_init_mem_region
> > > > >     guard(mutex)(&kvm->slots_lock)
> > > > >     kvm_gmem_populate
> > > > >         filemap_invalidate_lock(file->f_mapping)
> > > > >             __kvm_gmem_get_pfn      //1. get private PFN
> > > > >             post_populate           //tdx_gmem_post_populate
> > > > >                 get_user_pages_fast //2. get source page
> > > > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > > > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > > > >                                          source page to it.
> > > > > 
> > > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > > > invalidate lock also helps ensure the private PFN remains valid when
> > > > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > > > 
> > > > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > > > helps on sharing common code between SEV-SNP and TDX.
> > > > > 
> > > > > Problem
> > > > > ===
> > > > > (1)
> > > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > > > 
> > > > > kvm_gmem_get_pfn
> > > > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > > > 
> > > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > > > filemap invalidation lock.
> > > > 
> > > > Bringing the prior discussion over to here: it seems wrong that
> > > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > > > chain, because:
> > > > 
> > > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> > > >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> > > >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> > > >    that we are trying to work around. If we could simply pass that PFN down
> > > >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> > > >    deadlocks in the first place.
> > > 
> > > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> > > 
> > > > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> > > >    memory, and allows the post_populate callback to handle setting
> > > >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> > > >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> > > >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> > > >    kvm_gmem_populate() chain brings things 2 things in conflict with
> > > >    each other, and TDX seems to be relying on that fact that it doesn't
> > > >    implement a handler for kvm_arch_gmem_prepare(). 
> > > > 
> > > > I don't think this hurts anything in the current code, and I don't
> > > > personally see any issue with open-coding the population path if it doesn't
> > > > fit TDX very well, but there was some effort put into making
> > > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > > > design of the interface itself, but instead just some inflexibility on the
> > > > KVM MMU mapping side, then it seems more robust to address the latter if
> > > > possible.
> > > > 
> > > > Would something like the below be reasonable? 
> > > 
> > > No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> > > be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> > > that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> > > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> > > 
> > > I would much rather special case this path, because it absolutely is a special
> > > snowflake.  This even eliminates several exports of low level helpers that frankly
> > > have no business being used by TDX, e.g. kvm_mmu_reload().
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.h         |  2 +-
> > >  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
> > >  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
> > >  4 files changed, 78 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > > index b4b6860ab971..9cd7a34333af 100644
> > > --- a/arch/x86/kvm/mmu.h
> > > +++ b/arch/x86/kvm/mmu.h
> > > @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
> > >  #endif
> > >  
> > >  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
> > >  
> > >  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6e838cb6c9e1..bc937f8ed5a0 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  	return direct_page_fault(vcpu, fault);
> > >  }
> > >  
> > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> > > +				 u64 error_code, u8 *level)
> > >  {
> > >  	int r;
> > >  
> > > @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > >  		return -EIO;
> > >  	}
> > >  }
> > > -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > >  
> > >  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  				    struct kvm_pre_fault_memory *range)
> > > @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  	 * Shadow paging uses GVA for kvm page fault, so restrict to
> > >  	 * two-dimensional paging.
> > >  	 */
> > > -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > > +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > >  	if (r < 0)
> > >  		return r;
> > >  
> > > @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  	return min(range->size, end - range->gpa);
> > >  }
> > >  
> > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > +{
> > > +	struct kvm_page_fault fault = {
> > > +		.addr = gfn_to_gpa(gfn),
> > > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > > +		.prefetch = true,
> > > +		.is_tdp = true,
> > > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > > +
> > > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > > +		.req_level = PG_LEVEL_4K,
> > kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M,
> > because the private_max_mapping_level hook is only invoked in
> > kvm_mmu_faultin_pfn_gmem().
> > 
> > Updating lpage_info can fix it though.
> > 
> > > +		.goal_level = PG_LEVEL_4K,
> > > +		.is_private = true,
> > > +
> > > +		.gfn = gfn,
> > > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > > +		.pfn = pfn,
> > > +		.map_writable = true,
> > > +	};
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	int r;
> > > +
> > > +	lockdep_assert_held(&kvm->slots_lock);
> > > +
> > > +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> > > +		return -EIO;
> > > +
> > > +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> > > +		return -EPERM;
> > > +
> > > +	r = kvm_mmu_reload(vcpu);
> > > +	if (r)
> > > +		return r;
> > > +
> > > +	r = mmu_topup_memory_caches(vcpu, false);
> > > +	if (r)
> > > +		return r;
> > > +
> > > +	do {
> > > +		if (signal_pending(current))
> > > +			return -EINTR;
> > > +
> > > +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> > > +			return -EIO;
> > > +
> > > +		cond_resched();
> > > +
> > > +		guard(read_lock)(&kvm->mmu_lock);
> > > +
> > > +		r = kvm_tdp_mmu_map(vcpu, &fault);
> > > +	} while (r == RET_PF_RETRY);
> > > +
> > > +	if (r != RET_PF_FIXED)
> > > +		return -EIO;
> > > +
> > > +	/*
> > > +	 * The caller is responsible for ensuring that no MMU invalidations can
> > > +	 * occur.  Sanity check that the mapping hasn't been zapped.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> > > +		cond_resched();
> > > +
> > > +		scoped_guard(read_lock, &kvm->mmu_lock) {
> > > +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> > > +				return -EIO;
> > > +		}
> > > +	}
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> > 
> > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > log:
> > 
> > Problem
> > ===
> > ...
> > (2)
> > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > - filemap invalidation lock --> mm->mmap_lock
> > 
> > However, in future code, the shared filemap invalidation lock will be held
> > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > - mm->mmap_lock --> filemap invalidation lock
> 
> I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).

For the AB-BA issue,

kvm_gmem_fault_shared() can be invoked by

(AB1) the guest_memfd (supporting in-place conversion) used for the init memory
      region.

      When the memory is converted from private to shared during TD's runtime,
      i.e., after the init memory region population stage.
     (though it's unusually to convert GFNs populated from initial memory region
     to shared during TD's runtime, it's still a valid for guests to do so).

(AB2) the guest_memfd (supporting in-place conversion) which is not used for the
      init memory region.


The get_user_pages_fast() can be invoked by
(BA1) the guest_memfd (supporting in-place conversion) used for the init memory
      region, when userspace brings a separate buffer.

(BA2) the guest_memfd (not supporting in-place conversion) used for the init
      memory region.

> There was some discussion during previous guest_memfd upstream call
> (May/June?) about whether to continue using kvm_gmem_populate() (or the
> callback you hand it) to handle initializing memory contents before
> in-place encryption, verses just expecting that userspace will
> initialize the contents directly via mmap() prior to issuing any calls
> that trigger kvm_gmem_populate().
> 
> I was planning on enforcing that the 'src' parameter to
> kvm_gmem_populate() must be NULL for cases where
> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED is set, or otherwise it will return
> -EINVAL, because:
> 
> 1) it avoids this awkward path you mentioned where kvm_gmem_fault_shared()
>    triggers during kvm_gmem_populate()
It still can't.
Forcing src to NULL will remove the above (BA1), however, AB-BA lock issue is
still present between (BA2) and (AB1) (AB2).

Compared to fine-grained annotation to further address this issue, I think
Sean's fix [1] is cleaner. Besides, Vishal's concern [2] that "userspace can
still bring a separate source buffer even with in-place conversion available"
is reasonable.

[1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
[2] https://lore.kernel.org/all/CAGtprH9dCCxK=GwVZTUKCeERQGbYD78-t4xDzQprmwtGxDoZXw@mail.gmail.com/

> 2) it makes no sense to have to have to copy anything from 'src' when we
>    now support in-place update
> 
> For the SNP side, that will require a small API update for
> SNP_LAUNCH_UPDATE that mandates that corresponding 'uaddr' argument is
> ignored/disallowed in favor of in-place initialization from userspace via
> mmap(). Not sure if TDX would need similar API update.
> 
> Would that work on the TDX side as well?
I believe the TDX side will support it as well, but I prefer not to rely on it
to resolve the AB-BA issue.