linux-kernel - Re: [PATCH 3/5] KVM: gmem: Hold filemap invalidate lock while allocating/preparing folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250613151939.z5ztzrtibr6xatql@amd.com>
Date: Fri, 13 Jun 2025 10:19:39 -0500
From: Michael Roth <michael.roth@....com>
To: Yan Zhao <yan.y.zhao@...el.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng
	<ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
	<linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <jroedel@...e.de>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<amit.shah@....com>, <pratikrajesh.sampat@....com>, <ashish.kalra@....com>,
	<liam.merwick@...cle.com>, <david@...hat.com>, <quic_eberman@...cinc.com>
Subject: Re: [PATCH 3/5] KVM: gmem: Hold filemap invalidate lock while
 allocating/preparing folios

On Thu, Jun 12, 2025 at 08:40:59PM +0800, Yan Zhao wrote:
> On Tue, Jun 03, 2025 at 11:28:35PM -0700, Vishal Annapurve wrote:
> > On Mon, Jun 2, 2025 at 6:34 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Mon, Jun 02, 2025 at 06:05:32PM -0700, Vishal Annapurve wrote:
> > > > On Tue, May 20, 2025 at 11:49 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > On Mon, May 19, 2025 at 10:04:45AM -0700, Ackerley Tng wrote:
> > > > > > Ackerley Tng <ackerleytng@...gle.com> writes:
> > > > > >
> > > > > > > Yan Zhao <yan.y.zhao@...el.com> writes:
> > > > > > >
> > > > > > >> On Fri, Mar 14, 2025 at 05:20:21PM +0800, Yan Zhao wrote:
> > > > > > >>> This patch would cause host deadlock when booting up a TDX VM even if huge page
> > > > > > >>> is turned off. I currently reverted this patch. No further debug yet.
> > > > > > >> This is because kvm_gmem_populate() takes filemap invalidation lock, and for
> > > > > > >> TDX, kvm_gmem_populate() further invokes kvm_gmem_get_pfn(), causing deadlock.
> > > > > > >>
> > > > > > >> kvm_gmem_populate
> > > > > > >>   filemap_invalidate_lock
> > > > > > >>   post_populate
> > > > > > >>     tdx_gmem_post_populate
> > > > > > >>       kvm_tdp_map_page
> > > > > > >>        kvm_mmu_do_page_fault
> > > > > > >>          kvm_tdp_page_fault
> > > > > > >>       kvm_tdp_mmu_page_fault
> > > > > > >>         kvm_mmu_faultin_pfn
> > > > > > >>           __kvm_mmu_faultin_pfn
> > > > > > >>             kvm_mmu_faultin_pfn_private
> > > > > > >>               kvm_gmem_get_pfn
> > > > > > >>                 filemap_invalidate_lock_shared
> > > > > > >>
> > > > > > >> Though, kvm_gmem_populate() is able to take shared filemap invalidation lock,
> > > > > > >> (then no deadlock), lockdep would still warn "Possible unsafe locking scenario:
> > > > > > >> ...DEADLOCK" due to the recursive shared lock, since commit e918188611f0
> > > > > > >> ("locking: More accurate annotations for read_lock()").
> > > > > > >>
> > > > > > >
> > > > > > > Thank you for investigating. This should be fixed in the next revision.
> > > > > > >
> > > > > >
> > > > > > This was not fixed in v2 [1], I misunderstood this locking issue.
> > > > > >
> > > > > > IIUC kvm_gmem_populate() gets a pfn via __kvm_gmem_get_pfn(), then calls
> > > > > > part of the KVM fault handler to map the pfn into secure EPTs, then
> > > > > > calls the TDX module for the copy+encrypt.
> > > > > >
> > > > > > Regarding this lock, seems like KVM'S MMU lock is already held while TDX
> > > > > > does the copy+encrypt. Why must the filemap_invalidate_lock() also be
> > > > > > held throughout the process?
> > > > > If kvm_gmem_populate() does not hold filemap invalidate lock around all
> > > > > requested pages, what value should it return after kvm_gmem_punch_hole() zaps a
> > > > > mapping it just successfully installed?
> > > > >
> > > > > TDX currently only holds the read kvm->mmu_lock in tdx_gmem_post_populate() when
> > > > > CONFIG_KVM_PROVE_MMU is enabled, due to both slots_lock and the filemap
> > > > > invalidate lock being taken in kvm_gmem_populate().
> > > >
> > > > Does TDX need kvm_gmem_populate path just to ensure SEPT ranges are
> > > > not zapped during tdh_mem_page_add and tdh_mr_extend operations? Would
> > > > holding KVM MMU read lock during these operations sufficient to avoid
> > > > having to do this back and forth between TDX and gmem layers?
> > > I think the problem here is because in kvm_gmem_populate(),
> > > "__kvm_gmem_get_pfn(), post_populate(), and kvm_gmem_mark_prepared()"
> > > must be wrapped in filemap invalidate lock (shared or exclusive), right?
> > >
> > > Then, in TDX's post_populate() callback, the filemap invalidate lock is held
> > > again by kvm_tdp_map_page() --> ... ->kvm_gmem_get_pfn().
> > 
> > I am contesting the need of kvm_gmem_populate path altogether for TDX.
> > Can you help me understand what problem does kvm_gmem_populate path
> > help with for TDX?
> There is a long discussion on the list about this.
> 
> Basically TDX needs 3 steps for KVM_TDX_INIT_MEM_REGION.
> 1. Get the PFN
> 2. map the mirror page table
> 3. invoking tdh_mem_page_add().
> Holding filemap invalidation lock around the 3 steps helps ensure that the PFN
> passed to tdh_mem_page_add() is a valid one.
> 
> Rather then revisit it, what about fixing the contention more simply like this?
> Otherwise we can revisit the history.
> (The code is based on Ackerley's branch
> https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2, with patch "HACK: filemap_invalidate_lock() only for getting the pfn" reverted).
> 
> 
> commit d71956718d061926e5d91e5ecf60b58a0c3b2bad
> Author: Yan Zhao <yan.y.zhao@...el.com>
> Date:   Wed Jun 11 18:17:26 2025 +0800
> 
>     KVM: guest_memfd: Use shared filemap invalidate lock in kvm_gmem_populate()
> 
>     Convert kvm_gmem_populate() to use shared filemap invalidate lock. This is
>     to avoid deadlock caused by kvm_gmem_populate() further invoking
>     tdx_gmem_post_populate() which internally acquires shared filemap
>     invalidate lock in kvm_gmem_get_pfn().
> 
>     To avoid lockep warning by nested shared filemap invalidate lock,
>     avoid holding shared filemap invalidate lock in kvm_gmem_get_pfn() when
>     lockdep is enabled.
> 
>     Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 784fc1834c04..ccbb7ceb978a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -2393,12 +2393,16 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>         struct file *file = kvm_gmem_get_file(slot);
>         struct folio *folio;
>         bool is_prepared = false;
> +       bool get_shared_lock;
>         int r = 0;
> 
>         if (!file)
>                 return -EFAULT;
> 
> -       filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +       get_shared_lock = !IS_ENABLED(CONFIG_LOCKDEP) ||
> +                         !lockdep_is_held(&file_inode(file)->i_mapping->invalidate_lock);
> +       if (get_shared_lock)
> +               filemap_invalidate_lock_shared(file_inode(file)->i_mapping);

Hi Yan,

I had been working on some kind of locking scheme that could account for some
potential[1] changes needed to allowing concurrent updating of "preparedness"
state while still allowing for concurrent fault handling. I posted a tree
there in that link with an alternative scheme that's based on rw_semaphore
like filemap invalidate lock, but with some changes to allow the folio
lock to be taken to handle write-side updates to "preparedness" state
instead of needing to take a write-lock.

With that approach (or something similar), it is then possible to drop reliance
on using the filemap invalidate lock in kvm_gmem_get_pfn(), and that I
think would cleanly resolve this particular issue.

However, it was also suggested during the guest_memfd call that we revisit
the need to track preparedness in guest_memfd at all, and resulted in me
posting this rfc[2] that removes preparedness tracking from gmem
completely. That series is based on Ackerley's locking scheme from his
HugeTLBFS series however, which re-uses filemap invalidate rw_semaphore
to protect the shareability state, so you'd hit similar issues with
kvm_gmem_populate().

However, as above (and even more easily so since we don't need to do
anything fancy for concurrent "preparedness" updates), it would be
fairly trivial to replace the use of filemap invalidate lock with a
rw_semaphore that's dedicated to protecting shareability state, which
should make it possible to drop the use of
filemap_invalidate_lock[_shared]() in kvm_gmem_get_pfn().

But your above patch seems like it would at least get things working in
the meantime if there's still some discussion that needs to happen
before we can make a good call on:

  1) whether to continue to use the filemap invalidate or use a dedicated one
     (my 2 cents: use a dedicated lock to we don't have to deal with
     inheriting unintended/unecessary locking dependencies)
  2) whether or not is will be acceptable to drop preparedness-tracking
     from guest_memfd or not
     (my 2 cents: it will make all our lives much happier)
  3) open-code what kvm_gmem_populate() handles currently if we need
     extra flexibility WRT to locking
     (my 2 cents: if it can be avoided it's still nice to gmem
     handle/orchestrate this to some degree)

Thanks,

Mike

[1] https://lore.kernel.org/lkml/20250529054227.hh2f4jmyqf6igd3i@amd.com/
[2] https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com/

> 
>         folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>         if (IS_ERR(folio)) {
> @@ -2423,7 +2427,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>         else
>                 folio_put(folio);
>  out:
> -       filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +       if (get_shared_lock)
> +               filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>         fput(file);
>         return r;
>  }
> @@ -2536,7 +2541,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>         if (!file)
>                 return -EFAULT;
> 
> -       filemap_invalidate_lock(file->f_mapping);
> +       filemap_invalidate_lock_shared(file->f_mapping);
> 
>         npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
>         for (i = 0; i < npages; i += npages_to_populate) {
> @@ -2587,7 +2592,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>                         break;
>         }
> 
> -       filemap_invalidate_unlock(file->f_mapping);
> +       filemap_invalidate_unlock_shared(file->f_mapping);
> 
>         fput(file);
>         return ret && !i ? ret : i;
> 
> 
> If it looks good to you, then for the in-place conversion version of
> guest_memfd, there's one remaining issue left: an AB-BA lock issue between the
> shared filemap invalidate lock and mm->mmap_lock, i.e.,
> - In path kvm_gmem_fault_shared(),
>   the lock sequence is mm->mmap_lock --> filemap_invalidate_lock_shared(),
> - while in path kvm_gmem_populate(),
>   the lock sequence is filemap_invalidate_lock_shared() -->mm->mmap_lock.
> 
> We can fix it with below patch. The downside of the this patch is that it
> requires userspace to initialize all source pages passed to TDX, which I'm not
> sure if everyone likes it. If it cannot land, we still have another option:
> disallow the initial memory regions to be backed by the in-place conversion
> version of guest_memfd. If this can be enforced, then we can resolve the issue
> by annotating the lockdep, indicating that kvm_gmem_fault_shared() and
> kvm_gmem_populate() cannot occur on the same guest_memfd, so the two shared
> filemap invalidate locks in the two paths are not the same.
> 
> Author: Yan Zhao <yan.y.zhao@...el.com>
> Date:   Wed Jun 11 18:23:00 2025 +0800
> 
>     KVM: TDX: Use get_user_pages_fast_only() in tdx_gmem_post_populate()
> 
>     Convert get_user_pages_fast() to get_user_pages_fast_only()
>     in tdx_gmem_post_populate().
> 
>     Unlike get_user_pages_fast(), which will acquire mm->mmap_lock and fault in
>     physical pages after it finds the pages have not already faulted in or have
>     been zapped/swapped out, get_user_pages_fast_only() returns directly in
>     such cases.
> 
>     Using get_user_pages_fast_only() can avoid tdx_gmem_post_populate()
>     acquiring mm->mmap_lock, which may cause AB, BA lockdep warning with the
>     shared filemap invalidate lock when guest_memfd in-place conversion is
>     supported. (In path kvm_gmem_fault_shared(), the lock sequence is
>     mm->mmap_lock --> filemap_invalidate_lock_shared(), while in path
>     kvm_gmem_populate(), the lock sequence is filemap_invalidate_lock_shared()
>     -->mm->mmap_lock).
> 
>     Besides, using get_user_pages_fast_only() and returning directly to
>     userspace if a page is not present in the primary PTE can help detect a
>     careless case that the source pages are not initialized by userspace.
>     As initial memory region bypasses guest acceptance, copying an
>     uninitialized source page to guest could be harmful and undermine the page
>     measurement.
> 
>     Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 93c31eecfc60..462390dddf88 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3190,9 +3190,10 @@ static int tdx_gmem_post_populate_4k(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>          * Get the source page if it has been faulted in. Return failure if the
>          * source page has been swapped out or unmapped in primary memory.
>          */
> -       ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +       ret = get_user_pages_fast_only((unsigned long)src, 1, 0, &src_page);
>         if (ret < 0)
>                 return ret;
> +
>         if (ret != 1)
>                 return -ENOMEM;
>