[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHSgdEJpY/JF+a1f@yzhao56-desk>
Date: Mon, 14 Jul 2025 14:15:16 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Michael Roth <michael.roth@....com>, <pbonzini@...hat.com>,
<kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<rick.p.edgecombe@...el.com>, <kai.huang@...el.com>,
<adrian.hunter@...el.com>, <reinette.chatre@...el.com>,
<xiaoyao.li@...el.com>, <tony.lindgren@...el.com>,
<binbin.wu@...ux.intel.com>, <dmatlack@...gle.com>,
<isaku.yamahata@...el.com>, <ira.weiny@...el.com>, <vannapurve@...gle.com>,
<david@...hat.com>, <ackerleytng@...gle.com>, <tabba@...gle.com>,
<chao.p.peng@...el.com>
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from
kvm_gmem_populate()
On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Michael Roth wrote:
> > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > log:
> > >
> > > Problem
> > > ===
> > > ...
> > > (2)
> > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > - filemap invalidation lock --> mm->mmap_lock
> > >
> > > However, in future code, the shared filemap invalidation lock will be held
> > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > - mm->mmap_lock --> filemap invalidation lock
> >
> > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
>
> Irrespective of shared faults, I think the API could do with a bit of cleanup
> now that TDX has landed, i.e. now that we can see a bit more of the picture.
>
> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> with one page at a time. So that needs to be changed. I assume it's already
In TDX RFC v1, we deals with multiple pages at a time :)
https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
> address in one of the many upcoming series, but it still shows a flaw in the API.
>
> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> seems pretty straightforward, and would provide consistent ABI for all vendor
> flavors. E.g. as is, non-struct-page memory will work for SNP, but not TDX. The
> obvious downside is that struct-page becomes a requirement for SNP, but that
>
> The below could be tweaked to batch get_user_pages() into an array of pointers,
> but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> that hugepage support doesn't yet exist, trying to super optimize the hugepage
> case straightaway doesn't seem like a pressing concern.
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> struct file *file, gfn_t gfn, void __user *src,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> struct page *src_page = NULL;
> bool is_prepared = false;
> struct folio *folio;
> int ret, max_order;
> kvm_pfn_t pfn;
>
> if (src) {
> ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
get_user_pages_fast()?
get_user_pages() can't pass the assertion of mmap_assert_locked().
> if (ret < 0)
> return ret;
> if (ret != 1)
> return -ENOMEM;
> }
>
> filemap_invalidate_lock(file->f_mapping);
>
> if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
if (kvm_mem_is_private(kvm, gfn)) ? where
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
{
struct kvm_memory_slot *slot;
if (!IS_ENABLED(CONFIG_KVM_GMEM))
return false;
slot = gfn_to_memslot(kvm, gfn);
if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
return kvm_gmem_is_private(slot, gfn);
return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
}
> ret = -EINVAL;
> goto out_unlock;
> }
>
> folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
GFN+1 will return is_prepared == true.
> if (IS_ERR(folio)) {
> ret = PTR_ERR(folio);
> goto out_unlock;
> }
>
> folio_unlock(folio);
>
> if (is_prepared) {
> ret = -EEXIST;
> goto out_put_folio;
> }
So, skip this check of is_prepare?
>
> ret = post_populate(kvm, gfn, pfn, src_page, opaque);
Pass in the slot to post_populate() as well?
TDX may need to invoke hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M)
in tdx_gmem_post_populate() if kvm_tdp_mmu_map_private_pfn() does not check
the hook private_max_mapping_level for max_level as in
https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com.
> if (!ret)
> kvm_gmem_mark_prepared(folio);
if (!ret && !is_prepared)
kvm_gmem_mark_prepared(folio);
?
> out_put_folio:
> folio_put(folio);
> out_unlock:
> filemap_invalidate_unlock(file->f_mapping);
>
> if (src_page)
> put_page(src_page);
> return ret;
> }
>
> long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> struct file *file;
> struct kvm_memory_slot *slot;
> void __user *p;
> int ret = 0;
> long i;
>
> lockdep_assert_held(&kvm->slots_lock);
> if (npages < 0)
> return -EINVAL;
>
> slot = gfn_to_memslot(kvm, start_gfn);
> if (!kvm_slot_can_be_private(slot))
> return -EINVAL;
>
> file = kvm_gmem_get_file(slot);
> if (!file)
> return -EFAULT;
>
> npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> for (i = 0; i < npages; i ++) {
> if (signal_pending(current)) {
> ret = -EINTR;
> break;
> }
>
> p = src ? src + i * PAGE_SIZE : NULL;
>
> ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
> post_populate, opaque);
> if (ret)
> break;
> }
>
> fput(file);
> return ret && !i ? ret : i;
> }
Powered by blists - more mailing lists