linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aE/q9VKkmaCcuwpU@yzhao56-desk.sh.intel.com>
Date: Mon, 16 Jun 2025 17:59:17 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>, <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
	<x86@...nel.org>, <rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<jroedel@...e.de>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Wed, Jun 11, 2025 at 07:30:10AM -0700, Vishal Annapurve wrote:
> On Wed, Jun 4, 2025 at 7:45 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > We need to restore to the previous status (which includes the host page table)
> > if conversion can't be done.
> > That said, in my view, a better flow would be:
> >
> > 1. guest_memfd sends a pre-invalidation request to users (users here means the
> >    consumers in kernel of memory allocated from guest_memfd).
> >
> > 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
> >    proceed. For example, in the case of TDX, this might involve memory
> >    allocation and page splitting.
> >
> > 3. Based on the pre-check results, guest_memfd either aborts the invalidation or
> >    proceeds by sending the actual invalidation request.
> >
> > 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
> >    TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
> >    In such cases, TDX can callback guest_memfd to inform the poison-status of
> >    the page or elevate the page reference count.
> 
> Few questions here:
> 1) It sounds like the failure to remove entries from SEPT could only
> be due to bugs in the KVM/TDX module,
Yes.

> how reliable would it be to
> continue executing TDX VMs on the host once such bugs are hit?
The TDX VMs will be killed. However, the private pages are still mapped in the
SEPT (after the unmapping failure).
The teardown flow for TDX VM is:

do_exit
  |->exit_files
     |->kvm_gmem_release ==> (1) Unmap guest pages 
     |->release kvmfd
        |->kvm_destroy_vm  (2) Reclaiming resources
           |->kvm_arch_pre_destroy_vm  ==> Release hkid
           |->kvm_arch_destroy_vm  ==> Reclaim SEPT page table pages

Without holding page reference after (1) fails, the guest pages may have been
re-assigned by the host OS while they are still still tracked in the TDX module.


> 2) Is it reliable to continue executing the host kernel and other
> normal VMs once such bugs are hit?
If with TDX holding the page ref count, the impact of unmapping failure of guest
pages is just to leak those pages.

> 3) Can the memory be reclaimed reliably if the VM is marked as dead
> and cleaned up right away?
As in the above flow, TDX needs to hold the page reference on unmapping failure
until after reclaiming is successful. Well, reclaiming itself is possible to
fail either.

So, below is my proposal. Showed in the simple POC code based on
https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2.

Patch 1: TDX increases page ref count on unmap failure.
Patch 2: Bail out private-to-shared conversion if splitting fails.
Patch 3: Make kvm_gmem_zap() return void.

After the change,
- the actual private-to-shared conversion will be not be executed on splitting
  failure (which could be due to out of memory or bugs in the KVM/TDX module) or
  unmapping failure (which is due to bugs in the KVM/TDX module).
- other callers of kvm_gmem_zap(), such as kvm_gmem_release(),
  kvm_gmem_error_folio(), kvm_gmem_punch_hole(), are still allowed to proceed.
  After truncating the pages out of the filemap, the pages could be leaked on
  purpose with the reference hold by TDX.


commit 50432c0bb1e10591714b6b880f43fc30797ca047
Author: Yan Zhao <yan.y.zhao@...el.com>
Date:   Tue Jun 10 00:02:30 2025 -0700

    KVM: TDX: Hold folio ref count on fatal error

    Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c60c1fa7b4ee..93c31eecfc60 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1502,6 +1502,15 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
        td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }

+/*
+ * Called when fatal error occurs during removing a TD's page.
+ * Increase the folio ref count in case it's reused by other VMs or host.
+ */
+static void tdx_hold_page_on_error(struct kvm *kvm, struct page *page, int level)
+{
+       folio_ref_add(page_folio(page), KVM_PAGES_PER_HPAGE(level));
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
                            enum pg_level level, struct page *page)
 {
@@ -1868,12 +1877,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         * before any might be populated. Warn if zapping is attempted when
         * there can't be anything populated in the private EPT.
         */
-       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-               return -EINVAL;
+       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) {
+               ret = -EINVAL;
+               goto fatal_error;
+       }

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
-               return ret;
+               goto fatal_error;

        /*
         * TDX requires TLB tracking before dropping private page.  Do
@@ -1881,7 +1892,14 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
         */
        tdx_track(kvm);

-       return tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       ret = tdx_sept_drop_private_spte(kvm, gfn, level, page);
+       if (ret)
+               goto fatal_error;
+       return ret;
+fatal_error:
+       if (ret < 0)
+               tdx_hold_page_on_error(kvm, page, level);
+       return ret;
 }

 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,


commit 240acb13d4bd724b4c153b73cfba3cd14d3cc296
Author: Yan Zhao <yan.y.zhao@...el.com>
Date:   Tue Jun 10 19:26:56 2025 -0700

    KVM: guest_memfd: Add check kvm_gmem_private_has_safe_refcount()

    Check extra ref count on private pages in case of TDX unmap failure before
    private to shared conversion in the backend.

    In other zap cases, it's ok to do without this ref count check so that
    the error folio will be held by TDX after guest_memfd releases the folio.

    Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index af7943c0a8ba..1e1312bfa157 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -521,6 +521,41 @@ static void kvm_gmem_convert_invalidate_end(struct inode *inode,
                kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }

+static bool kvm_gmem_private_has_safe_refcount(struct inode *inode,
+                                              pgoff_t start, pgoff_t end)
+{
+       pgoff_t index = start;
+       size_t inode_nr_pages;
+       bool ret = true;
+       void *priv;
+
+       /*
+        * Conversion in !kvm_gmem_has_custom_allocator() case does not reach here.
+        */
+       if (!kvm_gmem_has_custom_allocator(inode))
+               return ret;
+
+       priv = kvm_gmem_allocator_private(inode);
+       inode_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+       while (index < end) {
+               struct folio *f;
+               f = filemap_get_folio(inode->i_mapping, index);
+               if (IS_ERR(f)) {
+                       index += inode_nr_pages;
+                       continue;
+               }
+
+               folio_put(f);
+               if (folio_ref_count(f) > folio_nr_pages(f)) {
+                       ret = false;
+                       break;
+               }
+               index += folio_nr_pages(f);
+       }
+       return ret;
+}
+
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
                                           struct conversion_work *work,
                                           bool to_shared, pgoff_t *error_index)
@@ -538,6 +573,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
                list_for_each_entry(gmem, gmem_list, entry) {
                        ret = kvm_gmem_zap(gmem, work->start, work_end,
                                           KVM_FILTER_PRIVATE, true);
+                       if (ret)
+                               return ret;
+                       if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
+                               return -EFAULT;
                }
        } else {
                unmap_mapping_pages(inode->i_mapping, work->start,


commit 26743993663313fa6f8741a43f22ed5ac21399c7
Author: Yan Zhao <yan.y.zhao@...el.com>
Date:   Tue Jun 10 20:01:23 2025 -0700

    KVM: guest_memfd: Move splitting KVM mappings out of kvm_gmem_zap()

    Modify kvm_gmem_zap() to return void and introduce a separate function,
    kvm_gmem_split_private(), to handle the splitting of private EPT.

    With these changes, kvm_gmem_zap() will only be executed after successful
    splitting across the entire conversion/punch range.

    Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1e1312bfa157..e81efcef0837 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -318,8 +318,7 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
        return refcount_safe;
 }

-static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
-                       enum kvm_gfn_range_filter filter, bool do_split)
+static int kvm_gmem_split_private(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end)
 {
        struct kvm_memory_slot *slot;
        struct kvm *kvm = gmem->kvm;
@@ -336,7 +335,7 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
                        .slot = slot,
                        .may_block = true,
-                       .attr_filter = filter,
+                       .attr_filter = KVM_FILTER_PRIVATE,
                };

                if (!locked) {
@@ -344,16 +343,13 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
                        locked = true;
                }

-               if (do_split) {
-                       ret = kvm_split_boundary_leafs(kvm, &gfn_range);
-                       if (ret < 0)
-                               goto out;
+               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+               if (ret < 0)
+                       goto out;

-                       flush |= ret;
-                       ret = 0;
-               }
+               flush |= ret;
+               ret = 0;

-               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
        }
 out:
        if (flush)
@@ -365,6 +361,42 @@ static int kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
        return ret;
 }

+static void kvm_gmem_zap(struct kvm_gmem *gmem, pgoff_t start, pgoff_t end,
+                       enum kvm_gfn_range_filter filter)
+{
+       struct kvm_memory_slot *slot;
+       struct kvm *kvm = gmem->kvm;
+       unsigned long index;
+       bool locked = false;
+       bool flush = false;
+
+       xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+               pgoff_t pgoff = slot->gmem.pgoff;
+               struct kvm_gfn_range gfn_range = {
+                       .start = slot->base_gfn + max(pgoff, start) - pgoff,
+                       .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+                       .slot = slot,
+                       .may_block = true,
+                       .attr_filter = filter,
+               };
+
+               if (!locked) {
+                       KVM_MMU_LOCK(kvm);
+                       locked = true;
+               }
+
+               flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+       }
+
+       if (flush)
+               kvm_flush_remote_tlbs(kvm);
+
+       if (locked)
+               KVM_MMU_UNLOCK(kvm);
+
+       return;
+}
+
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
                                      pgoff_t end)
 {
@@ -571,10 +603,10 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,

                gmem_list = &inode->i_mapping->i_private_list;
                list_for_each_entry(gmem, gmem_list, entry) {
-                       ret = kvm_gmem_zap(gmem, work->start, work_end,
-                                          KVM_FILTER_PRIVATE, true);
+                       ret = kvm_gmem_split_private(gmem, work->start, work_end);
                        if (ret)
                                return ret;
+                       kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
                        if (!kvm_gmem_private_has_safe_refcount(inode, work->start, work_end))
                                return -EFAULT;
                }
@@ -1471,9 +1503,10 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
                 * expensive.
                 */
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               ret = kvm_gmem_zap(gmem, start, end, filter, true);
+               ret = kvm_gmem_split_private(gmem, start, end);
                if (ret)
                        goto out;
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        if (kvm_gmem_has_custom_allocator(inode)) {
@@ -1606,7 +1639,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
         * memory, as its lifetime is associated with the inode, not the file.
         */
        kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED, false);
+       kvm_gmem_zap(gmem, 0, -1ul, KVM_FILTER_PRIVATE | KVM_FILTER_SHARED);
        kvm_gmem_invalidate_end(gmem, 0, -1ul);

        list_del(&gmem->entry);
@@ -1942,7 +1975,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol

                kvm_gmem_invalidate_begin(gmem, start, end);
                filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
-               kvm_gmem_zap(gmem, start, end, filter, false);
+               kvm_gmem_zap(gmem, start, end, filter);
        }

        /*


If the above changes are agreeable, we could consider a more ambitious approach:
introducing an interface like:

int guest_memfd_add_page_ref_count(gfn_t gfn, int nr);
int guest_memfd_dec_page_ref_count(gfn_t gfn, int nr);

This would allow guest_memfd to maintain an internal reference count for each
private GFN. TDX would call guest_memfd_add_page_ref_count() for mapping and
guest_memfd_dec_page_ref_count() after a successful unmapping. Before truncating
a private page from the filemap, guest_memfd could increase the real folio
reference count based on its internal reference count for the private GFN.