linux-kernel - [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113230759.1562024-1-michael.roth@amd.com>
Date: Thu, 13 Nov 2025 17:07:56 -0600
From: Michael Roth <michael.roth@....com>
To: <kvm@...r.kernel.org>
CC: <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
	<ira.weiny@...el.com>, <yan.y.zhao@...el.com>
Subject: [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion

This patchset is also available at:

  https://github.com/AMDESE/linux/tree/gmem-populate-rework-rfc1

and is based on top of kvm-x86/next (kvm-x86-next-2025.11.07)


Overview
--------

Yan previously posted a series[1] that reworked kvm_gmem_populate() to deal
with potential locking issues that might arise once in-place conversion
support[2] is added for guest_memfd. To quote Yan's original summary of the
issues:

  (1)
  In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
  changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
  invalidation lock for protecting its preparedness tracking. Similarly, the
  in-place conversion version of guest_memfd series by Ackerly also requires
  kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
  
  kvm_gmem_get_pfn
      filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
  
  However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
  in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
  filemap invalidation lock.
  
  (2)
  Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
  resulting in the following lock sequence in tdx_vcpu_init_mem_region():
  - filemap invalidation lock --> mm->mmap_lock
  
  However, in future code, the shared filemap invalidation lock will be held
  in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
  - mm->mmap_lock --> filemap invalidation lock
  
  This creates an AB-BA deadlock issue.

Sean has since then addressed (1) with his series[3] that avoids relying on
calling kvm_gmem_get_pfn() within the TDX post-populate callback to re-fetch
the PFN that was passed to it.

This series aims to address (2), which is still outstanding, and does so based
heavily on Sean's suggested approach[4] of hoisting the get_user_pages_fast()
out of the TDX post-populate callback so that it can be called prior to taking
the filemap invalidate lock so that the ABBA deadlock is no longer possible.

It additionally removes 'preparation' tracking from guest_memfd, which would
similarly complicate locking considerations in the context of in-place
conversion (and even moreso in the context of hugepage support). This has
been discussed during both the guest_memfd calls and PUCK calls, and so far
no strong objections have been given, so hopefully that particular change
isn't too controversial.


Some items worth noting/discussing
----------------------------------

(A) Unlike TDX, which has always enforced that the source address used to
    populate the contents of gmem pages via kvm_gmem_populate() is
    page-aligned, SNP explicitly allowed for this. This unfortunately means
    that instead of a simple 1:1 correspondance between source/target pages,
    post-populate callbacks need to be able to handle straddling multiple
    source pages to populate a single target page within guest_memfd, which
    complicates the handling. While the changes to the SNP post-populate
    callback in patch #3 are not horrendous, they certainly are not ideal.
    However, architectures that never allowed a non-page-aligned source
    address can essentially ignore the src_pages/src_offset considerations
    and simply assume/enforce src_offset is 0, and that src_pages[0] is the
    source struct page of relevance for each call.

    That said, it would be possible to have SNP copy unaligned pages into
    an intermediate set of bounce-buffer pages before passing them to some
    variant of kvm_gmem_populate() that skips the GUP and just works directly
    with the kernel-allocated bounce pages, but there is a performance hit
    there, and potentially some additional complexity with the interfaces to
    handle the different flow, so it's not clear if the trade-off is worth
    it.

    Another potential approach would be to take advantage of the fact that
    all *known* VMM implementations of SNP do use page-aligned source
    addresses, so it *may* be justifiable to retroactively enforce this as
    a requirement so that the post-populate callbacks can be simplified
    accordingly.

(B) While one of the aims of this rework is to implement things such that
    a separate source address can still be passed to kvm_gmem_populate()
    even though the gmem pages can be populated in-place from userspace
    beforehand, issues still arise if the source address itself has the
    KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set, e.g. if source/target
    addresses are the same page. One line of reasoning would be to
    conclude that KVM_MEMORY_ATTRIBUTE_PRIVATE implies that it cannot
    be used as the source of a GUP/copy_from_user(), and thus cases like
    source==target are naturally disallowed. Thus userspace has no choice
    but to populate pages in-place *prior* to setting the
    KVM_MEMORY_ATTRIBUTE_PRIVATE attribute (as kvm_gmem_populate()
    requires), and passing in NULL for the source such that the GUP can
    be skipped (otherwise, it will trigger the shared memory fault path,
    which will then SIGBUS because it will see that it is faulting in
    pages for which KVM_MEMORY_ATTRIBUTE_PRIVATE is set).

    While workable, this would at the very least involve documentation
    updates to KVM_TDX_INIT_MEM_REGION/KVM_SEV_SNP_LAUNCH_UPDATE to cover
    these soon-to-be-possible scenarios. Ira posted a patch separately
    that demonstrates how a NULL source could be safely handled within
    the TDX post-populate callback[5].

    
Known issues / TODO
-------------------

- Compile-tested only for the TDX bits (testing/feedback welcome!)


Thanks,

Mike


[1] https://lore.kernel.org/kvm/20250703062641.3247-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com/
[3] https://lore.kernel.org/kvm/20251030200951.3402865-1-seanjc@google.com/
[4] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/
[5] https://lore.kernel.org/kvm/20251105-tdx-init-in-place-v1-1-1196b67d0423@intel.com/


----------------------------------------------------------------
Michael Roth (3):
      KVM: guest_memfd: Remove preparation tracking
      KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION
      KVM: guest_memfd: GUP source pages prior to populating guest memory

 Documentation/virt/kvm/x86/intel-tdx.rst |  2 +-
 arch/x86/kvm/svm/sev.c                   | 40 +++++++++-----
 arch/x86/kvm/vmx/tdx.c                   | 20 +++----
 include/linux/kvm_host.h                 |  3 +-
 virt/kvm/guest_memfd.c                   | 89 ++++++++++++++++++--------------
 5 files changed, 88 insertions(+), 66 deletions(-)