linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68717342cfafc_37c14b294a6@iweiny-mobl.notmuch>
Date: Fri, 11 Jul 2025 15:25:38 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Michael Roth <michael.roth@....com>, Sean Christopherson
	<seanjc@...gle.com>
CC: Yan Zhao <yan.y.zhao@...el.com>, <pbonzini@...hat.com>,
	<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<rick.p.edgecombe@...el.com>, <kai.huang@...el.com>,
	<adrian.hunter@...el.com>, <reinette.chatre@...el.com>,
	<xiaoyao.li@...el.com>, <tony.lindgren@...el.com>,
	<binbin.wu@...ux.intel.com>, <dmatlack@...gle.com>,
	<isaku.yamahata@...el.com>, <ira.weiny@...el.com>, <vannapurve@...gle.com>,
	<david@...hat.com>, <ackerleytng@...gle.com>, <tabba@...gle.com>,
	<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from
 kvm_gmem_populate()

Michael Roth wrote:
> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > > 
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > > 
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > > 
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> > 
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> > 
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> 
> Yes, for the THP-based hugepage patchset the preparation-tracking was
> modified so that only the range passed to post_populate() callback will
> be marked as prepared, so I think that would have addressed this case
> unless there's something more specific to TDX in that regard (otherwise
> it seems analogous to SNP considerations).
> 
> However, I think the current leaning here is to drop tracking of
> prepared/unprepared entirely for in-place conversion / hugepage case. I
> posted an RFC patch that does this in prep for in-place conversion support:
> 
>   https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/
> 
> So in that case kvm_arch_gmem_prepare() would always be called via
> kvm_gmem_get_pfn() and the architecture-specific code will handle
> checking whether additional prep is needed or not. (In that context,
> one might even consider removing kvm_arch_gmem_prepare() entirely from
> gmem in that case and considering it a KVM/hypervisor MMU hook instead
> (which would be more along the lines of TDX, but that's a less-important
> topic)).
> 
> 
> > with one page at a time.  So that needs to be changed.  I assume it's already
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> > 
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> > 
> > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > case straightaway doesn't seem like a pressing concern.
> > 
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > 				struct file *file, gfn_t gfn, void __user *src,
> > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > 	struct page *src_page = NULL;
> > 	bool is_prepared = false;
> > 	struct folio *folio;
> > 	int ret, max_order;
> > 	kvm_pfn_t pfn;
> > 
> > 	if (src) {
> > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> > 		if (ret < 0)
> > 			return ret;
> > 		if (ret != 1)
> > 			return -ENOMEM;
> > 	}
> 
> One tricky part here is that the uAPI currently expects the pages to
> have the private attribute set prior to calling kvm_gmem_populate(),
> which gets enforced below.
> 
> For in-place conversion: the idea is that userspace will convert
> private->shared to update in-place, then immediately convert back
> shared->private;

Why convert from private to shared and back to private?  Userspace which
knows about mmap and supports it should create shared pages, mmap, write
data, then convert to private.

Old userspace will create private and pass in a source pointer for the
initial data as it does today.

Internally, the post_populate() callback only needs to know if the data is
in place or coming from somewhere else (ie src != NULL).

> so that approach would remain compatible with above
> behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
> and do a GUP or copy_from_user() on it at any point, regardless if
> it is is outside of filemap_invalidate_lock(), then
> kvm_gmem_fault_shared() will return -EACCES. The only 2 ways I see
> around that are to either a) stop enforcing that pages that get
> processed by kvm_gmem_populate() are private for in-place conversion
> case, or b) enforce that 'src' is NULL for in-place conversion case.
> 
> I think either would handle the ABBA issue.
> 
> One nice thing about a) is that we wouldn't have to change the API for
> SNP_LAUNCH_UPDATE since 'src' could still get passed to
> kvm_gmem_populate(), but it's kind of silly since we are copying the
> pages into themselves in that case. And we still need uAPI updates
> regardless expected initial shared/private state for pages passed to
> SNP_LAUNCH_UPDATE so to me it seems simpler to just never have to deal
> with 'src' anymore outside of the legacy cases (but maybe your change
> still makes sense to have just in terms of making the sequencing of
> locks/etc clearer for the legacy case).

Agreed.

Ira

> 
> -Mike
> 
> > 
> > 	filemap_invalidate_lock(file->f_mapping);
> > 
> > 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > 	if (IS_ERR(folio)) {
> > 		ret = PTR_ERR(folio);
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio_unlock(folio);
> > 
> > 	if (is_prepared) {
> > 		ret = -EEXIST;
> > 		goto out_put_folio;
> > 	}
> > 
> > 	ret = post_populate(kvm, gfn, pfn, src_page, opaque);
> > 	if (!ret)
> > 		kvm_gmem_mark_prepared(folio);
> > 
> > out_put_folio:
> > 	folio_put(folio);
> > out_unlock:
> > 	filemap_invalidate_unlock(file->f_mapping);
> > 
> > 	if (src_page)
> > 		put_page(src_page);
> > 	return ret;
> > }
> > 
> > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> > 		       kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	struct file *file;
> > 	struct kvm_memory_slot *slot;
> > 	void __user *p;
> > 	int ret = 0;
> > 	long i;
> > 
> > 	lockdep_assert_held(&kvm->slots_lock);
> > 	if (npages < 0)
> > 		return -EINVAL;
> > 
> > 	slot = gfn_to_memslot(kvm, start_gfn);
> > 	if (!kvm_slot_can_be_private(slot))
> > 		return -EINVAL;
> > 
> > 	file = kvm_gmem_get_file(slot);
> > 	if (!file)
> > 		return -EFAULT;
> > 
> > 	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> > 	for (i = 0; i < npages; i ++) {
> > 		if (signal_pending(current)) {
> > 			ret = -EINTR;
> > 			break;
> > 		}
> > 
> > 		p = src ? src + i * PAGE_SIZE : NULL;
> > 
> > 		ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
> > 					  post_populate, opaque);
> > 		if (ret)
> > 			break;
> > 	}
> > 
> > 	fput(file);
> > 	return ret && !i ? ret : i;
> > }
> >