linux-kernel - [PATCH RFC v1 0/5] KVM: guest_memfd: Support in-place conversion for CoCo VMs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250613005400.3694904-1-michael.roth@amd.com>
Date: Thu, 12 Jun 2025 19:53:55 -0500
From: Michael Roth <michael.roth@....com>
To: <kvm@...r.kernel.org>
CC: <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <david@...hat.com>, <tabba@...gle.com>,
	<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <ira.weiny@...el.com>,
	<thomas.lendacky@....com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<vbabka@...e.cz>, <joro@...tes.org>, <pratikrajesh.sampat@....com>,
	<liam.merwick@...cle.com>, <yan.y.zhao@...el.com>, <aik@....com>
Subject: [PATCH RFC v1 0/5] KVM: guest_memfd: Support in-place conversion for CoCo VMs

This patchset is also available at:

  https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1

and is based on top of the following patches plucked from Ackerley's
HugeTLBFS series[1], which add support for tracking/converting guest_memfd
pages between private/shared states so the same physical pages can be used
to handle both private/shared accesses by the guest or by userspace:

  KVM: selftests: Update script to map shared memory from guest_memfd
  KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd
  KVM: selftests: Add script to exercise private_mem_conversions_test
  KVM: selftests: Test conversion flows for guest_memfd
  KVM: selftests: Allow cleanup of ucall_pool from host
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE
  KVM: selftests: Test flag validity after guest_memfd supports conversions
  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  fs: Refactor to provide function that allocates a secure anonymous inode

  "[RFC PATCH v2 00/51] 1G page support for guest_memfd"
  https://lore.kernel.org/lkml/cover.1747264138.git.ackerleytng@google.com/

which is in turn based on the following series[2] from Fuad which implements
the initial support for guest_memfd to manage shared memory and allow it to
be mmap()'d into userspace:

  "[PATCH v12 00/18] KVM: Mapping guest_memfd backed memory at the host for software protected VMs"
  https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/

(One of the main goals of posting this series in it's current form is to
identify the common set of dependencies to enable in-place conversion
support for SEV-SNP, TDX, and pKVM, which have been coined "stage 2"
according to upstreaming plans discussed during guest_memfd bi-weekly calls
and summarized by David here[3] (Fuad's series[2] being "stage 1"),
so please feel free to chime in here if there's any feedback on whether
something like the above set of dependencies is a reasonable starting point
for "stage 2" and how best to handle setting up a common tree to track this
dependency.)


Overview
--------

Currently guest_memfd is only used by CoCo VMs to handle private memory, and
relies on hole-punching to free memory from guest_memfd when it is converted
to shared and re-allocated from normal/non-gmem memory that's been associated
with the memslot. This has some major downsides:

  1) for future use-cases like 1GB HugeTLB support in gmem, the ability to
     hole-punch pages after conversion is almost completely lost since
     truncation at sub-1GB granularities won't free the page, and truncation
     at 1GB or greater granularity will likely userspace to track free ranges
     and defer truncation until the entire range has been converted, which
     will often never happen for a particular 1GB range.

  2) for things like PCI passthrough, where normal/non-gmem memory is
     pinned, this quickly leads to doubled guest memory usage once the guest
     has converted most of its pages to private, but the previous allocated
     pages can't be hole-punched until being unmapped from IOMMU. While there
     are reasonable solutions for this like the RamDiscardManager proposed[4]
     for QEMU, in-place conversion handles this memory doubling problem
     essentially for free, and makes it easier to mix PCI passthrough of
     normal devices together with PCI passthrough of trusted devices (e.g.
     for SEV-TIO) where it's actually *private* memory that needs to be
     mapped into the IOMMU, and thus there's less clarity about what pages
     can/can't be freed/unmapped from IOMMU when pages are converted between
     shared/private.

  3) interfaces like mbind() which rely on virtual addresses to set NUMA
     affinities are not available for unmappable guest_memfd pages, requiring
     additional management interfaces to handle guest_memfd separately from
     normal memory.

  4) not being able to populate pages directly from userspace due to
     guest_memfd being unmappable, requiring the user of intermediate buffers
     which the kernel then copies into corresponding guest_memfd page.

Supporting in-place conversion, and allowing shared pages to be mmap() and
accessed by userspace similarly to normal/non-CoCo guests, addresses most of
these issues fairly naturally.

With the above-mentioned dependencies in place, only a fairly small set of
additional changes are needed to allow SEV-SNP and (hopefully) other CoCo
platforms to use guest_memfd in this manner, and that "small set" of
additional changes is what this series is meant to call out to consider for
potential inclusion into the common "stage 2" tree so that pKVM/TDX in-place
conversion can be similarly enabled with minimal additional changes needed
on top and so we can start looking at getting the related userspace APIs
finalized.


Some topics for discussion
--------------------------

1) Removal of preparation tracking from guest_memfd
   
   This is the most significant change in this series, since I know in
   the past there was a strong desire to have guest_memfd be aware of
   what has/hasn't been prepared rather than off-loading the knowledge
   to platform-specific code. While it was initially planned to maintain
   this preparedness-tracking in guest_memfd, there are some complexities
   it brings along in the context of in-place conversion and hugetlb
   enablement that I think make it worthwhile to revisit.
   
   A) it has unique locking requirements[5], since "preparation" needs to
      happen lazily to gain any benefit from lazy-acceptance/lazy-faulting
      of guest memory, and that generally ends up being at fault-time, but
      data structures to track "preparation" require locks to update the
      state, and reduce guest_memfd ability to handle concurrent faults
      from multiple vCPUs efficiently. While there are proposed locking
      schemes that could potentially handle this reasonably[5], getting rid
      of this tracking in guest_memfd allows for things like shared/private
      state to be tracked via much simpler schemes like rw_semaphores (or
      just re-using the filemap invalidate lock as is done here).

   B) only SEV-SNP is actually making any meaningful use of it. Platforms
      like TDX handle preparation and preparation-tracking outside of
      guest_memfd, so operating under the general assumption that guest_memfd
      has a clear notion of what is/isn't prepared could bite us in some
      cases versus just punting to platform-specific tracking.


2) Proper point to begin generally advertising KVM_CAP_GMEM_CONVERSION?

   Currently the various dependencies these patches are based on top of
   advertise support for converting guest_memfd pages between shared/private
   via KVM_CAP_GMEM_CONVERSION. However, for SEV-SNP at least, these
   additional pages are needed. So perhaps the initial enablement for
   KVM_CAP_GMEM_CONVERSION should only be done for non-CoCo VMs to enable
   the self-tests so that userspace can reliably probe for support for a
   specific VM type?


Testing
-------

This series has only been tested with SEV-SNP guests using the following
modified QEMU branch:

  https://github.com/amdese/qemu/commits/snp-mmap-gmem0-wip4

and beyond that only via the kselftests added by Ackerley that exercise the
gmem conversion support/ioctls this series is based on.


TODO
----

 - Rebase on (or merge into?) proper "stage 2" once we work out what that is.
 - Confirm no breakages to Fuad's "stage 1" kselftests 
 - Add kselftest coverage for SNP guests using shareable gmem.


References
----------

[1] "[RFC PATCH v2 00/51] 1G page support for guest_memfd",
    https://lore.kernel.org/lkml/cover.1747264138.git.ackerleytng@google.com/
[2] "[PATCH v12 00/18] KVM: Mapping guest_memfd backed memory at the host for software protected VMs",
    https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/
[3] "[Overview] guest_memfd extensions and dependencies 2025-05-15",
    https://lore.kernel.org/kvm/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/
[4] "[PATCH v7 0/5] Enable shared device assignment"
    https://lore.kernel.org/kvm/20250612082747.51539-1-chenyi.qiang@intel.com/
[5] https://lore.kernel.org/kvm/20250529054227.hh2f4jmyqf6igd3i@amd.com/


Thanks!

-Mike


----------------------------------------------------------------
Michael Roth (5):
      KVM: guest_memfd: Remove preparation tracking
      KVM: guest_memfd: Only access KVM memory attributes when appropriate
      KVM: guest_memfd: Call arch invalidation hooks when converting to shared
      KVM: guest_memfd: Don't prepare shared folios
      KVM: SEV: Make SNP_LAUNCH_UPDATE ignore 'uaddr' if guest_memfd is shareable

 .../virt/kvm/x86/amd-memory-encryption.rst         |  4 +-
 arch/x86/kvm/svm/sev.c                             | 14 +++-
 virt/kvm/guest_memfd.c                             | 92 +++++++++++++---------
 3 files changed, 68 insertions(+), 42 deletions(-)