[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113230759.1562024-1-michael.roth@amd.com>
Date: Thu, 13 Nov 2025 17:07:56 -0600
From: Michael Roth <michael.roth@....com>
To: <kvm@...r.kernel.org>
CC: <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
<ira.weiny@...el.com>, <yan.y.zhao@...el.com>
Subject: [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion
This patchset is also available at:
https://github.com/AMDESE/linux/tree/gmem-populate-rework-rfc1
and is based on top of kvm-x86/next (kvm-x86-next-2025.11.07)
Overview
--------
Yan previously posted a series[1] that reworked kvm_gmem_populate() to deal
with potential locking issues that might arise once in-place conversion
support[2] is added for guest_memfd. To quote Yan's original summary of the
issues:
(1)
In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
invalidation lock for protecting its preparedness tracking. Similarly, the
in-place conversion version of guest_memfd series by Ackerly also requires
kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
kvm_gmem_get_pfn
filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
filemap invalidation lock.
(2)
Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
resulting in the following lock sequence in tdx_vcpu_init_mem_region():
- filemap invalidation lock --> mm->mmap_lock
However, in future code, the shared filemap invalidation lock will be held
in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
- mm->mmap_lock --> filemap invalidation lock
This creates an AB-BA deadlock issue.
Sean has since then addressed (1) with his series[3] that avoids relying on
calling kvm_gmem_get_pfn() within the TDX post-populate callback to re-fetch
the PFN that was passed to it.
This series aims to address (2), which is still outstanding, and does so based
heavily on Sean's suggested approach[4] of hoisting the get_user_pages_fast()
out of the TDX post-populate callback so that it can be called prior to taking
the filemap invalidate lock so that the ABBA deadlock is no longer possible.
It additionally removes 'preparation' tracking from guest_memfd, which would
similarly complicate locking considerations in the context of in-place
conversion (and even moreso in the context of hugepage support). This has
been discussed during both the guest_memfd calls and PUCK calls, and so far
no strong objections have been given, so hopefully that particular change
isn't too controversial.
Some items worth noting/discussing
----------------------------------
(A) Unlike TDX, which has always enforced that the source address used to
populate the contents of gmem pages via kvm_gmem_populate() is
page-aligned, SNP explicitly allowed for this. This unfortunately means
that instead of a simple 1:1 correspondance between source/target pages,
post-populate callbacks need to be able to handle straddling multiple
source pages to populate a single target page within guest_memfd, which
complicates the handling. While the changes to the SNP post-populate
callback in patch #3 are not horrendous, they certainly are not ideal.
However, architectures that never allowed a non-page-aligned source
address can essentially ignore the src_pages/src_offset considerations
and simply assume/enforce src_offset is 0, and that src_pages[0] is the
source struct page of relevance for each call.
That said, it would be possible to have SNP copy unaligned pages into
an intermediate set of bounce-buffer pages before passing them to some
variant of kvm_gmem_populate() that skips the GUP and just works directly
with the kernel-allocated bounce pages, but there is a performance hit
there, and potentially some additional complexity with the interfaces to
handle the different flow, so it's not clear if the trade-off is worth
it.
Another potential approach would be to take advantage of the fact that
all *known* VMM implementations of SNP do use page-aligned source
addresses, so it *may* be justifiable to retroactively enforce this as
a requirement so that the post-populate callbacks can be simplified
accordingly.
(B) While one of the aims of this rework is to implement things such that
a separate source address can still be passed to kvm_gmem_populate()
even though the gmem pages can be populated in-place from userspace
beforehand, issues still arise if the source address itself has the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set, e.g. if source/target
addresses are the same page. One line of reasoning would be to
conclude that KVM_MEMORY_ATTRIBUTE_PRIVATE implies that it cannot
be used as the source of a GUP/copy_from_user(), and thus cases like
source==target are naturally disallowed. Thus userspace has no choice
but to populate pages in-place *prior* to setting the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute (as kvm_gmem_populate()
requires), and passing in NULL for the source such that the GUP can
be skipped (otherwise, it will trigger the shared memory fault path,
which will then SIGBUS because it will see that it is faulting in
pages for which KVM_MEMORY_ATTRIBUTE_PRIVATE is set).
While workable, this would at the very least involve documentation
updates to KVM_TDX_INIT_MEM_REGION/KVM_SEV_SNP_LAUNCH_UPDATE to cover
these soon-to-be-possible scenarios. Ira posted a patch separately
that demonstrates how a NULL source could be safely handled within
the TDX post-populate callback[5].
Known issues / TODO
-------------------
- Compile-tested only for the TDX bits (testing/feedback welcome!)
Thanks,
Mike
[1] https://lore.kernel.org/kvm/20250703062641.3247-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com/
[3] https://lore.kernel.org/kvm/20251030200951.3402865-1-seanjc@google.com/
[4] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/
[5] https://lore.kernel.org/kvm/20251105-tdx-init-in-place-v1-1-1196b67d0423@intel.com/
----------------------------------------------------------------
Michael Roth (3):
KVM: guest_memfd: Remove preparation tracking
KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION
KVM: guest_memfd: GUP source pages prior to populating guest memory
Documentation/virt/kvm/x86/intel-tdx.rst | 2 +-
arch/x86/kvm/svm/sev.c | 40 +++++++++-----
arch/x86/kvm/vmx/tdx.c | 20 +++----
include/linux/kvm_host.h | 3 +-
virt/kvm/guest_memfd.c | 89 ++++++++++++++++++--------------
5 files changed, 88 insertions(+), 66 deletions(-)
Powered by blists - more mailing lists