[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250613005400.3694904-1-michael.roth@amd.com>
Date: Thu, 12 Jun 2025 19:53:55 -0500
From: Michael Roth <michael.roth@....com>
To: <kvm@...r.kernel.org>
CC: <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <david@...hat.com>, <tabba@...gle.com>,
<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <ira.weiny@...el.com>,
<thomas.lendacky@....com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
<vbabka@...e.cz>, <joro@...tes.org>, <pratikrajesh.sampat@....com>,
<liam.merwick@...cle.com>, <yan.y.zhao@...el.com>, <aik@....com>
Subject: [PATCH RFC v1 0/5] KVM: guest_memfd: Support in-place conversion for CoCo VMs
This patchset is also available at:
https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1
and is based on top of the following patches plucked from Ackerley's
HugeTLBFS series[1], which add support for tracking/converting guest_memfd
pages between private/shared states so the same physical pages can be used
to handle both private/shared accesses by the guest or by userspace:
KVM: selftests: Update script to map shared memory from guest_memfd
KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd
KVM: selftests: Add script to exercise private_mem_conversions_test
KVM: selftests: Test conversion flows for guest_memfd
KVM: selftests: Allow cleanup of ucall_pool from host
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE
KVM: selftests: Test flag validity after guest_memfd supports conversions
KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
KVM: Query guest_memfd for private/shared status
KVM: guest_memfd: Skip LRU for guest_memfd folios
KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
KVM: guest_memfd: Introduce and use shareability to guard faulting
KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
fs: Refactor to provide function that allocates a secure anonymous inode
"[RFC PATCH v2 00/51] 1G page support for guest_memfd"
https://lore.kernel.org/lkml/cover.1747264138.git.ackerleytng@google.com/
which is in turn based on the following series[2] from Fuad which implements
the initial support for guest_memfd to manage shared memory and allow it to
be mmap()'d into userspace:
"[PATCH v12 00/18] KVM: Mapping guest_memfd backed memory at the host for software protected VMs"
https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/
(One of the main goals of posting this series in it's current form is to
identify the common set of dependencies to enable in-place conversion
support for SEV-SNP, TDX, and pKVM, which have been coined "stage 2"
according to upstreaming plans discussed during guest_memfd bi-weekly calls
and summarized by David here[3] (Fuad's series[2] being "stage 1"),
so please feel free to chime in here if there's any feedback on whether
something like the above set of dependencies is a reasonable starting point
for "stage 2" and how best to handle setting up a common tree to track this
dependency.)
Overview
--------
Currently guest_memfd is only used by CoCo VMs to handle private memory, and
relies on hole-punching to free memory from guest_memfd when it is converted
to shared and re-allocated from normal/non-gmem memory that's been associated
with the memslot. This has some major downsides:
1) for future use-cases like 1GB HugeTLB support in gmem, the ability to
hole-punch pages after conversion is almost completely lost since
truncation at sub-1GB granularities won't free the page, and truncation
at 1GB or greater granularity will likely userspace to track free ranges
and defer truncation until the entire range has been converted, which
will often never happen for a particular 1GB range.
2) for things like PCI passthrough, where normal/non-gmem memory is
pinned, this quickly leads to doubled guest memory usage once the guest
has converted most of its pages to private, but the previous allocated
pages can't be hole-punched until being unmapped from IOMMU. While there
are reasonable solutions for this like the RamDiscardManager proposed[4]
for QEMU, in-place conversion handles this memory doubling problem
essentially for free, and makes it easier to mix PCI passthrough of
normal devices together with PCI passthrough of trusted devices (e.g.
for SEV-TIO) where it's actually *private* memory that needs to be
mapped into the IOMMU, and thus there's less clarity about what pages
can/can't be freed/unmapped from IOMMU when pages are converted between
shared/private.
3) interfaces like mbind() which rely on virtual addresses to set NUMA
affinities are not available for unmappable guest_memfd pages, requiring
additional management interfaces to handle guest_memfd separately from
normal memory.
4) not being able to populate pages directly from userspace due to
guest_memfd being unmappable, requiring the user of intermediate buffers
which the kernel then copies into corresponding guest_memfd page.
Supporting in-place conversion, and allowing shared pages to be mmap() and
accessed by userspace similarly to normal/non-CoCo guests, addresses most of
these issues fairly naturally.
With the above-mentioned dependencies in place, only a fairly small set of
additional changes are needed to allow SEV-SNP and (hopefully) other CoCo
platforms to use guest_memfd in this manner, and that "small set" of
additional changes is what this series is meant to call out to consider for
potential inclusion into the common "stage 2" tree so that pKVM/TDX in-place
conversion can be similarly enabled with minimal additional changes needed
on top and so we can start looking at getting the related userspace APIs
finalized.
Some topics for discussion
--------------------------
1) Removal of preparation tracking from guest_memfd
This is the most significant change in this series, since I know in
the past there was a strong desire to have guest_memfd be aware of
what has/hasn't been prepared rather than off-loading the knowledge
to platform-specific code. While it was initially planned to maintain
this preparedness-tracking in guest_memfd, there are some complexities
it brings along in the context of in-place conversion and hugetlb
enablement that I think make it worthwhile to revisit.
A) it has unique locking requirements[5], since "preparation" needs to
happen lazily to gain any benefit from lazy-acceptance/lazy-faulting
of guest memory, and that generally ends up being at fault-time, but
data structures to track "preparation" require locks to update the
state, and reduce guest_memfd ability to handle concurrent faults
from multiple vCPUs efficiently. While there are proposed locking
schemes that could potentially handle this reasonably[5], getting rid
of this tracking in guest_memfd allows for things like shared/private
state to be tracked via much simpler schemes like rw_semaphores (or
just re-using the filemap invalidate lock as is done here).
B) only SEV-SNP is actually making any meaningful use of it. Platforms
like TDX handle preparation and preparation-tracking outside of
guest_memfd, so operating under the general assumption that guest_memfd
has a clear notion of what is/isn't prepared could bite us in some
cases versus just punting to platform-specific tracking.
2) Proper point to begin generally advertising KVM_CAP_GMEM_CONVERSION?
Currently the various dependencies these patches are based on top of
advertise support for converting guest_memfd pages between shared/private
via KVM_CAP_GMEM_CONVERSION. However, for SEV-SNP at least, these
additional pages are needed. So perhaps the initial enablement for
KVM_CAP_GMEM_CONVERSION should only be done for non-CoCo VMs to enable
the self-tests so that userspace can reliably probe for support for a
specific VM type?
Testing
-------
This series has only been tested with SEV-SNP guests using the following
modified QEMU branch:
https://github.com/amdese/qemu/commits/snp-mmap-gmem0-wip4
and beyond that only via the kselftests added by Ackerley that exercise the
gmem conversion support/ioctls this series is based on.
TODO
----
- Rebase on (or merge into?) proper "stage 2" once we work out what that is.
- Confirm no breakages to Fuad's "stage 1" kselftests
- Add kselftest coverage for SNP guests using shareable gmem.
References
----------
[1] "[RFC PATCH v2 00/51] 1G page support for guest_memfd",
https://lore.kernel.org/lkml/cover.1747264138.git.ackerleytng@google.com/
[2] "[PATCH v12 00/18] KVM: Mapping guest_memfd backed memory at the host for software protected VMs",
https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/
[3] "[Overview] guest_memfd extensions and dependencies 2025-05-15",
https://lore.kernel.org/kvm/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/
[4] "[PATCH v7 0/5] Enable shared device assignment"
https://lore.kernel.org/kvm/20250612082747.51539-1-chenyi.qiang@intel.com/
[5] https://lore.kernel.org/kvm/20250529054227.hh2f4jmyqf6igd3i@amd.com/
Thanks!
-Mike
----------------------------------------------------------------
Michael Roth (5):
KVM: guest_memfd: Remove preparation tracking
KVM: guest_memfd: Only access KVM memory attributes when appropriate
KVM: guest_memfd: Call arch invalidation hooks when converting to shared
KVM: guest_memfd: Don't prepare shared folios
KVM: SEV: Make SNP_LAUNCH_UPDATE ignore 'uaddr' if guest_memfd is shareable
.../virt/kvm/x86/amd-memory-encryption.rst | 4 +-
arch/x86/kvm/svm/sev.c | 14 +++-
virt/kvm/guest_memfd.c | 92 +++++++++++++---------
3 files changed, 68 insertions(+), 42 deletions(-)
Powered by blists - more mailing lists