linux-kernel - Re: [PATCH 3/5] KVM: gmem: Hold filemap invalidate lock while allocating/preparing folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250613180418.bo4vqveigxsq2ouu@amd.com>
Date: Fri, 13 Jun 2025 13:04:18 -0500
From: Michael Roth <michael.roth@....com>
To: Yan Zhao <yan.y.zhao@...el.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng
	<ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
	<linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <jroedel@...e.de>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<amit.shah@....com>, <pratikrajesh.sampat@....com>, <ashish.kalra@....com>,
	<liam.merwick@...cle.com>, <david@...hat.com>, <quic_eberman@...cinc.com>
Subject: Re: [PATCH 3/5] KVM: gmem: Hold filemap invalidate lock while
 allocating/preparing folios

On Thu, Jun 12, 2025 at 08:40:59PM +0800, Yan Zhao wrote:
> On Tue, Jun 03, 2025 at 11:28:35PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 2, 2025 at 6:34 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Mon, Jun 02, 2025 at 06:05:32PM -0700, Vishal Annapurve wrote:
> > > > On Tue, May 20, 2025 at 11:49 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > On Mon, May 19, 2025 at 10:04:45AM -0700, Ackerley Tng wrote:
> > > > > > Ackerley Tng <ackerleytng@...gle.com> writes:
> > > > > >
> > > > > > > Yan Zhao <yan.y.zhao@...el.com> writes:
> > > > > > >
> > > > > > >> On Fri, Mar 14, 2025 at 05:20:21PM +0800, Yan Zhao wrote:
> > > > > > >>> This patch would cause host deadlock when booting up a TDX VM even if huge page
> > > > > > >>> is turned off. I currently reverted this patch. No further debug yet.
> > > > > > >> This is because kvm_gmem_populate() takes filemap invalidation lock, and for
> > > > > > >> TDX, kvm_gmem_populate() further invokes kvm_gmem_get_pfn(), causing deadlock.
> > > > > > >>
> > > > > > >> kvm_gmem_populate
> > > > > > >>   filemap_invalidate_lock
> > > > > > >>   post_populate
> > > > > > >>     tdx_gmem_post_populate
> > > > > > >>       kvm_tdp_map_page
> > > > > > >>        kvm_mmu_do_page_fault
> > > > > > >>          kvm_tdp_page_fault
> > > > > > >>       kvm_tdp_mmu_page_fault
> > > > > > >>         kvm_mmu_faultin_pfn
> > > > > > >>           __kvm_mmu_faultin_pfn
> > > > > > >>             kvm_mmu_faultin_pfn_private
> > > > > > >>               kvm_gmem_get_pfn
> > > > > > >>                 filemap_invalidate_lock_shared
> > > > > > >>
> > > > > > >> Though, kvm_gmem_populate() is able to take shared filemap invalidation lock,
> > > > > > >> (then no deadlock), lockdep would still warn "Possible unsafe locking scenario:
> > > > > > >> ...DEADLOCK" due to the recursive shared lock, since commit e918188611f0
> > > > > > >> ("locking: More accurate annotations for read_lock()").
> > > > > > >>
> > > > > > >
> > > > > > > Thank you for investigating. This should be fixed in the next revision.
> > > > > > >
> > > > > >
> > > > > > This was not fixed in v2 [1], I misunderstood this locking issue.
> > > > > >
> > > > > > IIUC kvm_gmem_populate() gets a pfn via __kvm_gmem_get_pfn(), then calls
> > > > > > part of the KVM fault handler to map the pfn into secure EPTs, then
> > > > > > calls the TDX module for the copy+encrypt.
> > > > > >
> > > > > > Regarding this lock, seems like KVM'S MMU lock is already held while TDX
> > > > > > does the copy+encrypt. Why must the filemap_invalidate_lock() also be
> > > > > > held throughout the process?
> > > > > If kvm_gmem_populate() does not hold filemap invalidate lock around all
> > > > > requested pages, what value should it return after kvm_gmem_punch_hole() zaps a
> > > > > mapping it just successfully installed?
> > > > >
> > > > > TDX currently only holds the read kvm->mmu_lock in tdx_gmem_post_populate() when
> > > > > CONFIG_KVM_PROVE_MMU is enabled, due to both slots_lock and the filemap
> > > > > invalidate lock being taken in kvm_gmem_populate().
> > > >
> > > > Does TDX need kvm_gmem_populate path just to ensure SEPT ranges are
> > > > not zapped during tdh_mem_page_add and tdh_mr_extend operations? Would
> > > > holding KVM MMU read lock during these operations sufficient to avoid
> > > > having to do this back and forth between TDX and gmem layers?
> > > I think the problem here is because in kvm_gmem_populate(),
> > > "__kvm_gmem_get_pfn(), post_populate(), and kvm_gmem_mark_prepared()"
> > > must be wrapped in filemap invalidate lock (shared or exclusive), right?
> > >
> > > Then, in TDX's post_populate() callback, the filemap invalidate lock is held
> > > again by kvm_tdp_map_page() --> ... ->kvm_gmem_get_pfn().
> > 
> > I am contesting the need of kvm_gmem_populate path altogether for TDX.
> > Can you help me understand what problem does kvm_gmem_populate path
> > help with for TDX?
> There is a long discussion on the list about this.
> 
> Basically TDX needs 3 steps for KVM_TDX_INIT_MEM_REGION.
> 1. Get the PFN
> 2. map the mirror page table
> 3. invoking tdh_mem_page_add().
> Holding filemap invalidation lock around the 3 steps helps ensure that the PFN
> passed to tdh_mem_page_add() is a valid one.

Since those requirements are already satisfied with kvm_gmem_populate(),
then maybe this issue is more with the fact that tdh_mem_page_add() is
making a separate call to kvm_gmem_get_pfn() even though the callback
has been handed a stable PFN that's protected with the filemap
invalidate lock.

Maybe some variant of kvm_tdp_map_page()/kvm_mmu_do_page_fault() that
can be handed the PFN and related fields up-front rather than grabbing
them later would be a more direct way to solve this? That would give us
more flexibility on the approaches I mentioned in my other response for
how to protect shareability state.

This also seems more correct in the sense that the current path triggers:

  tdx_gmem_post_populate
    kvm_tdp_mmu_page_fault
      kvm_gmem_get_pfn
        kvm_gmem_prepare_folio

even the kvm_gmem_populate() intentially avoids call kvm_gmem_get_pfn() in
favor of __kvm_gmem_get_pfn() specifically to avoid triggering the preparation
hooks, since kvm_gmem_populate() is a special case of preparation that needs
to be handled seperately/differently from the fault-time hooks.

This probably doesn't affect TDX because TDX doesn't make use of prepare
hooks, but since it's complicating things here it seems like we should address
it directly rather than work around it. Maybe it could even be floated as a
patch directly against kvm/next?

Thanks,

Mike

> 
> Rather then revisit it, what about fixing the contention more simply like this?
> Otherwise we can revisit the history.
> (The code is based on Ackerley's branch
> https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2, with patch "HACK: filemap_invalidate_lock() only for getting the pfn" reverted).
> 
> 
> commit d71956718d061926e5d91e5ecf60b58a0c3b2bad
> Author: Yan Zhao <yan.y.zhao@...el.com>
> Date:   Wed Jun 11 18:17:26 2025 +0800
> 
>     KVM: guest_memfd: Use shared filemap invalidate lock in kvm_gmem_populate()
> 
>     Convert kvm_gmem_populate() to use shared filemap invalidate lock. This is
>     to avoid deadlock caused by kvm_gmem_populate() further invoking
>     tdx_gmem_post_populate() which internally acquires shared filemap
>     invalidate lock in kvm_gmem_get_pfn().
> 
>     To avoid lockep warning by nested shared filemap invalidate lock,
>     avoid holding shared filemap invalidate lock in kvm_gmem_get_pfn() when
>     lockdep is enabled.
> 
>     Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 784fc1834c04..ccbb7ceb978a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -2393,12 +2393,16 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>         struct file *file = kvm_gmem_get_file(slot);
>         struct folio *folio;
>         bool is_prepared = false;
> +       bool get_shared_lock;
>         int r = 0;
> 
>         if (!file)
>                 return -EFAULT;
> 
> -       filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +       get_shared_lock = !IS_ENABLED(CONFIG_LOCKDEP) ||
> +                         !lockdep_is_held(&file_inode(file)->i_mapping->invalidate_lock);
> +       if (get_shared_lock)
> +               filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> 
>         folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>         if (IS_ERR(folio)) {
> @@ -2423,7 +2427,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>         else
>                 folio_put(folio);
>  out:
> -       filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +       if (get_shared_lock)
> +               filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>         fput(file);
>         return r;
>  }
> @@ -2536,7 +2541,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>         if (!file)
>                 return -EFAULT;
> 
> -       filemap_invalidate_lock(file->f_mapping);
> +       filemap_invalidate_lock_shared(file->f_mapping);
> 
>         npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
>         for (i = 0; i < npages; i += npages_to_populate) {
> @@ -2587,7 +2592,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>                         break;
>         }
> 
> -       filemap_invalidate_unlock(file->f_mapping);
> +       filemap_invalidate_unlock_shared(file->f_mapping);
> 
>         fput(file);
>         return ret && !i ? ret : i;
> 
> 
> If it looks good to you, then for the in-place conversion version of
> guest_memfd, there's one remaining issue left: an AB-BA lock issue between the
> shared filemap invalidate lock and mm->mmap_lock, i.e.,
> - In path kvm_gmem_fault_shared(),
>   the lock sequence is mm->mmap_lock --> filemap_invalidate_lock_shared(),
> - while in path kvm_gmem_populate(),
>   the lock sequence is filemap_invalidate_lock_shared() -->mm->mmap_lock.
> 
> We can fix it with below patch. The downside of the this patch is that it
> requires userspace to initialize all source pages passed to TDX, which I'm not
> sure if everyone likes it. If it cannot land, we still have another option:
> disallow the initial memory regions to be backed by the in-place conversion
> version of guest_memfd. If this can be enforced, then we can resolve the issue
> by annotating the lockdep, indicating that kvm_gmem_fault_shared() and
> kvm_gmem_populate() cannot occur on the same guest_memfd, so the two shared
> filemap invalidate locks in the two paths are not the same.
> 
> Author: Yan Zhao <yan.y.zhao@...el.com>
> Date:   Wed Jun 11 18:23:00 2025 +0800
> 
>     KVM: TDX: Use get_user_pages_fast_only() in tdx_gmem_post_populate()
> 
>     Convert get_user_pages_fast() to get_user_pages_fast_only()
>     in tdx_gmem_post_populate().
> 
>     Unlike get_user_pages_fast(), which will acquire mm->mmap_lock and fault in
>     physical pages after it finds the pages have not already faulted in or have
>     been zapped/swapped out, get_user_pages_fast_only() returns directly in
>     such cases.
> 
>     Using get_user_pages_fast_only() can avoid tdx_gmem_post_populate()
>     acquiring mm->mmap_lock, which may cause AB, BA lockdep warning with the
>     shared filemap invalidate lock when guest_memfd in-place conversion is
>     supported. (In path kvm_gmem_fault_shared(), the lock sequence is
>     mm->mmap_lock --> filemap_invalidate_lock_shared(), while in path
>     kvm_gmem_populate(), the lock sequence is filemap_invalidate_lock_shared()
>     -->mm->mmap_lock).
> 
>     Besides, using get_user_pages_fast_only() and returning directly to
>     userspace if a page is not present in the primary PTE can help detect a
>     careless case that the source pages are not initialized by userspace.
>     As initial memory region bypasses guest acceptance, copying an
>     uninitialized source page to guest could be harmful and undermine the page
>     measurement.
> 
>     Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 93c31eecfc60..462390dddf88 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3190,9 +3190,10 @@ static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>          * Get the source page if it has been faulted in. Return failure if the
>          * source page has been swapped out or unmapped in primary memory.
>          */
> -       ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +       ret = get_user_pages_fast_only((unsigned long)src, 1, 0, &src_page);
>         if (ret < 0)
>                 return ret;
> +
>         if (ret != 1)
>                 return -ENOMEM;
>