lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHUmcxuh0a6WfiVr@google.com>
Date: Mon, 14 Jul 2025 08:46:59 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Michael Roth <michael.roth@....com>, pbonzini@...hat.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com, 
	adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...el.com, binbin.wu@...ux.intel.com, dmatlack@...gle.com, 
	isaku.yamahata@...el.com, ira.weiny@...el.com, vannapurve@...gle.com, 
	david@...hat.com, ackerleytng@...gle.com, tabba@...gle.com, 
	chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

On Mon, Jul 14, 2025, Yan Zhao wrote:
> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > > 
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > > 
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > > 
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> > 
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> > 
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> > with one page at a time.  So that needs to be changed.  I assume it's already
> In TDX RFC v1, we deals with multiple pages at a time :)
> https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
> 
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> > 
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> > 
> > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > case straightaway doesn't seem like a pressing concern.
> 
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > 				struct file *file, gfn_t gfn, void __user *src,
> > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > 	struct page *src_page = NULL;
> > 	bool is_prepared = false;
> > 	struct folio *folio;
> > 	int ret, max_order;
> > 	kvm_pfn_t pfn;
> > 
> > 	if (src) {
> > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> get_user_pages_fast()?
> 
> get_user_pages() can't pass the assertion of mmap_assert_locked().

Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
prefer to not use a fast variant, so that userspace isn't required to prefault
(and pin?) the source.

So get_user_pages_unlocked()?

> 
> > 		if (ret < 0)
> > 			return ret;
> > 		if (ret != 1)
> > 			return -ENOMEM;
> > 	}
> > 
> > 	filemap_invalidate_lock(file->f_mapping);
> > 
> > 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> if (kvm_mem_is_private(kvm, gfn)) ? where
> 
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
>         struct kvm_memory_slot *slot;
> 
>         if (!IS_ENABLED(CONFIG_KVM_GMEM))
>                 return false;
> 
>         slot = gfn_to_memslot(kvm, gfn);
>         if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
>                 return kvm_gmem_is_private(slot, gfn);
> 
>         return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> }
> 
> 
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> GFN+1 will return is_prepared == true.

I don't see any reason to try and make the current code truly work with hugepages.
Unless I've misundertood where we stand, the correctness of hugepage support is
going to depend heavily on the implementation for preparedness.  I.e. trying to
make this all work with per-folio granulartiy just isn't possible, no?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ