[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com>
Date: Thu, 3 Jul 2025 09:51:28 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com,
adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com,
tony.lindgren@...el.com, binbin.wu@...ux.intel.com, dmatlack@...gle.com,
isaku.yamahata@...el.com, ira.weiny@...el.com, michael.roth@....com,
david@...hat.com, ackerleytng@...gle.com, tabba@...gle.com,
chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
On Wed, Jul 2, 2025 at 11:29 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
>
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
>
> tdx_vcpu_init_mem_region
> guard(mutex)(&kvm->slots_lock)
> kvm_gmem_populate
> filemap_invalidate_lock(file->f_mapping)
> __kvm_gmem_get_pfn //1. get private PFN
> post_populate //tdx_gmem_post_populate
> get_user_pages_fast //2. get source page
> kvm_tdp_map_page //3. map private PFN to mirror root
> tdh_mem_page_add //4. add private PFN to S-EPT and copy
> source page to it.
>
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
>
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
>
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
>
> kvm_gmem_get_pfn
> filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.
>
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
>
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
>
> This creates an AB-BA deadlock issue.
>
> These two issues should still present in Michael Roth's code [7], [8].
>
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
>
> tdx_vcpu_init_mem_region
> guard(mutex)(&kvm->slots_lock)
> tdx_init_mem_populate
> get_user_pages_fast //1. get source page
> kvm_tdp_map_page //2. map private PFN to mirror root
> read_lock(&kvm->mmu_lock);
> kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
> mirror root and return the mapped
> private PFN.
> tdh_mem_page_add //4. add private PFN to S-EPT and copy source
> page to it
> read_unlock(&kvm->mmu_lock);
>
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
>
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
>
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
>
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
>
> a) TDX keeps using kvm_gmem_populate().
> Pros: - keep the status quo
> - share common code between SEV-SNP and TDX, though TDX does not
> need to prepare folios.
> Cons: - we need to explore solutions to the locking issues, e.g. the
> proposal at [11].
> - PFN is faulted in twice for each GFN:
> one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
>
> b) Michael suggested introducing some variant of
> kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
> kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
> Pro: - TDX can still invoke kvm_gmem_populate().
> can share common code between SEV-SNP and TDX.
> Cons: - only TDX needs this variant.
> - can't fix the 2nd AB-BA lock issue.
>
> c) Change in this patch
> Pro: greater flexibility. Simplify the implementation for both SEV-SNP
> and TDX.
> Con: undermine the purpose of sharing common code.
> kvm_gmem_populate() may only be usable by SEV-SNP in future.
>
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
>
> Suggested-by: Vishal Annapurve <vannapurve@...gle.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> ---
Thanks Yan for revisiting the initial memory population for TDX VMs.
This implementation seems much cleaner to me.
Acked-by: Vishal Annapurve <vannapurve@...gle.com>
Tested-by: Vishal Annapurve <vannapurve@...gle.com>
Powered by blists - more mailing lists