linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aHUmcxuh0a6WfiVr@google.com>
Date: Mon, 14 Jul 2025 08:46:59 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Michael Roth <michael.roth@....com>, pbonzini@...hat.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com, 
	adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...el.com, binbin.wu@...ux.intel.com, dmatlack@...gle.com, 
	isaku.yamahata@...el.com, ira.weiny@...el.com, vannapurve@...gle.com, 
	david@...hat.com, ackerleytng@...gle.com, tabba@...gle.com, 
	chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

On Mon, Jul 14, 2025, Yan Zhao wrote:
> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > > 
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > > 
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > > 
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> > 
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> > 
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> > with one page at a time.  So that needs to be changed.  I assume it's already
> In TDX RFC v1, we deals with multiple pages at a time :)
> https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
> 
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> > 
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> > 
> > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > case straightaway doesn't seem like a pressing concern.
> 
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > 				struct file *file, gfn_t gfn, void __user *src,
> > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > 	struct page *src_page = NULL;
> > 	bool is_prepared = false;
> > 	struct folio *folio;
> > 	int ret, max_order;
> > 	kvm_pfn_t pfn;
> > 
> > 	if (src) {
> > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> get_user_pages_fast()?
> 
> get_user_pages() can't pass the assertion of mmap_assert_locked().

Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
prefer to not use a fast variant, so that userspace isn't required to prefault
(and pin?) the source.

So get_user_pages_unlocked()?

> 
> > 		if (ret < 0)
> > 			return ret;
> > 		if (ret != 1)
> > 			return -ENOMEM;
> > 	}
> > 
> > 	filemap_invalidate_lock(file->f_mapping);
> > 
> > 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> if (kvm_mem_is_private(kvm, gfn)) ? where
> 
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
>         struct kvm_memory_slot *slot;
> 
>         if (!IS_ENABLED(CONFIG_KVM_GMEM))
>                 return false;
> 
>         slot = gfn_to_memslot(kvm, gfn);
>         if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
>                 return kvm_gmem_is_private(slot, gfn);
> 
>         return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> }
> 
> 
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> GFN+1 will return is_prepared == true.

I don't see any reason to try and make the current code truly work with hugepages.
Unless I've misundertood where we stand, the correctness of hugepage support is
going to depend heavily on the implementation for preparedness.  I.e. trying to
make this all work with per-folio granulartiy just isn't possible, no?