[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <691f6d3d93231_502d810036@iweiny-mobl.notmuch>
Date: Thu, 20 Nov 2025 13:34:21 -0600
From: Ira Weiny <ira.weiny@...el.com>
To: Michael Roth <michael.roth@....com>, <kvm@...r.kernel.org>
CC: <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
<ira.weiny@...el.com>, <yan.y.zhao@...el.com>
Subject: Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to
populating guest memory
Michael Roth wrote:
> Currently the post-populate callbacks handle copying source pages into
> private GPA ranges backed by guest_memfd, where kvm_gmem_populate()
> acquires the filemap invalidate lock, then calls a post-populate
> callback which may issue a get_user_pages() on the source pages prior to
> copying them into the private GPA (e.g. TDX).
>
> This will not be compatible with in-place conversion, where the
> userspace page fault path will attempt to acquire filemap invalidate
> lock while holding the mm->mmap_lock, leading to a potential ABBA
> deadlock[1].
>
> Address this by hoisting the GUP above the filemap invalidate lock so
> that these page faults path can be taken early, prior to acquiring the
> filemap invalidate lock.
>
> It's not currently clear whether this issue is reachable with the
> current implementation of guest_memfd, which doesn't support in-place
> conversion, however it does provide a consistent mechanism to provide
> stable source/target PFNs to callbacks rather than punting to
> vendor-specific code, which allows for more commonality across
> architectures, which may be worthwhile even without in-place conversion.
After thinking on the alignment issue:
In the direction we are going, in-place conversion, I'm struggling to
see why keeping the complexity of allowing a miss-aligned src pointer for
the data (which BTW seems to require at least an aligned size to (x *
PAGE_SIZE to not leak data?) is valuable.
Once in-place is complete the entire page needs to be converted to private
and so it seems keeping that alignment just makes things cleaner without
really restricting any known use cases.
General comments below.
[snip]
> @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> goto err;
> }
>
> - if (src) {
> - void *vaddr = kmap_local_pfn(pfn + i);
> + if (src_pages) {
> + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i]));
> + void *dst_vaddr = kmap_local_pfn(pfn + i);
>
> - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) {
> - ret = -EFAULT;
> - goto err;
> + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset);
> + kunmap_local(src_vaddr);
> +
> + if (src_offset) {
> + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1]));
> +
> + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset);
^^^^^^^^^^
PAGE_SIZE - src_offset?
> + kunmap_local(src_vaddr);
> }
> - kunmap_local(vaddr);
> +
> + kunmap_local(dst_vaddr);
> }
>
> ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K,
> @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> if (!snp_page_reclaim(kvm, pfn + i) &&
> sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> - void *vaddr = kmap_local_pfn(pfn + i);
> + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i]));
> + void *dst_vaddr = kmap_local_pfn(pfn + i);
>
> - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE))
> - pr_debug("Failed to write CPUID page back to userspace\n");
> + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset);
> + kunmap_local(src_vaddr);
> +
> + if (src_offset) {
> + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1]));
> +
> + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset);
^^^^^^^^^^
PAGE_SIZE - src_offset?
> + kunmap_local(src_vaddr);
> + }
>
> - kunmap_local(vaddr);
> + kunmap_local(dst_vaddr);
> }
>
> /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 57ed101a1181..dd5439ec1473 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg {
> };
>
> static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> - void __user *src, int order, void *_arg)
> + struct page **src_pages, loff_t src_offset,
> + int order, void *_arg)
> {
> struct tdx_gmem_post_populate_arg *arg = _arg;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> u64 err, entry, level_state;
> gpa_t gpa = gfn_to_gpa(gfn);
> - struct page *src_page;
> int ret, i;
>
> if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> return -EIO;
>
> - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm))
> + /* Source should be page-aligned, in which case src_offset will be 0. */
> + if (KVM_BUG_ON(src_offset))
This failed to compile, need the kvm parameter in the macro.
> return -EINVAL;
>
> - /*
> - * Get the source page if it has been faulted in. Return failure if the
> - * source page has been swapped out or unmapped in primary memory.
> - */
> - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> - if (ret < 0)
> - return ret;
> - if (ret != 1)
> - return -ENOMEM;
> -
> - kvm_tdx->page_add_src = src_page;
> + kvm_tdx->page_add_src = src_pages[i];
i is uninitialized here. src_pages[0] should be fine.
All the src_offset stuff in the rest of the patch would just go away if we
made that restriction for SNP. You mentioned there was not a real use
case for it. Also technically I think TDX _could_ do the same thing SNP
populate is doing. The kernel could map the buffer do the offset copy to
the destination page and do the in-place encryption. But I've not tried
it because I really think this is more effort than the whole kernel needs
to do. If a use case becomes necessary it may be better to have that in
the gmem core once TDX is tested.
Thoughts?
Ira
[snip]
Powered by blists - more mailing lists