linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com>
Date: Thu, 3 Jul 2025 09:51:28 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com, 
	adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...el.com, binbin.wu@...ux.intel.com, dmatlack@...gle.com, 
	isaku.yamahata@...el.com, ira.weiny@...el.com, michael.roth@....com, 
	david@...hat.com, ackerleytng@...gle.com, tabba@...gle.com, 
	chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

On Wed, Jul 2, 2025 at 11:29 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
>
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     kvm_gmem_populate
>         filemap_invalidate_lock(file->f_mapping)
>             __kvm_gmem_get_pfn      //1. get private PFN
>             post_populate           //tdx_gmem_post_populate
>                 get_user_pages_fast //2. get source page
>                 kvm_tdp_map_page    //3. map private PFN to mirror root
>                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
>                                          source page to it.
>
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
>
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
>
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
>
> kvm_gmem_get_pfn
>     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.
>
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
>
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
>
> This creates an AB-BA deadlock issue.
>
> These two issues should still present in Michael Roth's code [7], [8].
>
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     tdx_init_mem_populate
>         get_user_pages_fast //1. get source page
>         kvm_tdp_map_page    //2. map private PFN to mirror root
>         read_lock(&kvm->mmu_lock);
>         kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
>                                         mirror root and return the mapped
>                                         private PFN.
>         tdh_mem_page_add    //4. add private PFN to S-EPT and copy source
>                                  page to it
>         read_unlock(&kvm->mmu_lock);
>
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
>
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
>
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
>
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
>
> a) TDX keeps using kvm_gmem_populate().
>    Pros: - keep the status quo
>          - share common code between SEV-SNP and TDX, though TDX does not
>            need to prepare folios.
>    Cons: - we need to explore solutions to the locking issues, e.g. the
>            proposal at [11].
>          - PFN is faulted in twice for each GFN:
>            one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
>
> b) Michael suggested introducing some variant of
>    kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
>    kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
>    Pro:  - TDX can still invoke kvm_gmem_populate().
>            can share common code between SEV-SNP and TDX.
>    Cons: - only TDX needs this variant.
>          - can't fix the 2nd AB-BA lock issue.
>
> c) Change in this patch
>    Pro: greater flexibility. Simplify the implementation for both SEV-SNP
>         and TDX.
>    Con: undermine the purpose of sharing common code.
>         kvm_gmem_populate() may only be usable by SEV-SNP in future.
>
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
>
> Suggested-by: Vishal Annapurve <vannapurve@...gle.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> ---

Thanks Yan for revisiting the initial memory population for TDX VMs.
This implementation seems much cleaner to me.

Acked-by: Vishal Annapurve <vannapurve@...gle.com>
Tested-by: Vishal Annapurve <vannapurve@...gle.com>