linux-kernel - Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTZmY1P6uq8KWwKr@yzhao56-desk.sh.intel.com>
Date: Mon, 8 Dec 2025 13:49:15 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sagi Shahar <sagis@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>, <kas@...nel.org>,
	<tabba@...gle.com>, <ackerleytng@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vannapurve@...gle.com>,
	<vbabka@...e.cz>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used
 in split path

On Fri, Dec 05, 2025 at 12:14:46AM -0600, Sagi Shahar wrote:
> On Thu, Aug 7, 2025 at 4:48 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
> >
> > Preallocate a page to be used in the split_external_spt() path.
> >
> > Kernel needs one PAMT page pair for external_spt and one that provided
> > directly to the TDH.MEM.PAGE.DEMOTE SEAMCALL.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@...el.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> > ---
> > RFC v2:
> > - Pulled from
> >   git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git tdx/dpamt-huge.
> > - Implemented the flow of topup pamt_page_cache in
> >   tdp_mmu_split_huge_pages_root() (Yan)
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu/mmu.c          |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 51 +++++++++++++++++++++++++++++++++
> >  3 files changed, 54 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6b6c46c27390..508b133df903 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> >  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> >         struct kvm_mmu_memory_cache split_desc_cache;
> >
> > +       struct kvm_mmu_memory_cache pamt_page_cache;
> > +
> 
> The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> preallocating pages for pamt. I'm wondering if you've considered how
> this would work here since some of the calls requiring pamt originate
> from user space ioctls and therefore are not associated with a vcpu.
I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
per-VM pamt_page_cache.

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 43dd295b7fd6..91bea25da528 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -48,6 +48,9 @@ struct kvm_tdx {
         * Set/unset is protected with kvm->mmu_lock.
         */
        bool wait_for_sept_zap;
+
+       spinlock_t external_kvm_split_cache_lock;
+       struct tdx_prealloc prealloc_split_cache;
 };

> Since the tdx_prealloc is a per vcpu struct there are no race issues
> when multiple vcpus need to add pamt pages but here it would be
> trickier here because theoretically, multiple threads could split
> different pages simultaneously.
A spin lock external_kvm_split_cache_lock is introduced to protect the cache
enqueue and dequeue.
(a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
    - Since cache dequeue is already under write mmu_lock in
      tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
      releasing another spin lock doesn't matter.
    - Though the cache enqueue in topup_external_split_cache() is not protected
      by mmu_lock, protecting enqueue with a spinlock should not reduce
      concurrency.

(b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock: 
    Introducing a new spinlock may hurt concurrency for a brief duration (which
    is necessary).
    However, there's no known (future) use case for multiple threads invoking
    tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.

    For future splitting under shared mmu_lock in the fault path, we'll use
    the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
    kvm_get_running_vcpu() to differentiate between the two caches.


Here's the new diff in TDP MMU rebased to Dynamic PAMT v4.

+/*
+ * Check the per-VM external split cache under write mmu_lock or read mmu_lock
+ * in tdp_mmu_split_huge_pages_root().
+ *
+ * When need_topup_external_split_cache() returns true, the mmu_lock is held
+ * throughout
+ * (a) need_topup_external_split_cache(), and
+ * (b) the cache consumption (in tdx_sept_split_private_spte() called by
+ *     tdp_mmu_split_huge_page()).
+ *
+ * Throughout the execution from (a) to (b):
+ * - With write mmu_lock, the per-VM external split cache is exclusively
+ *   accessed by a single user. Therefore, the result returned from
+ *   need_topup_external_split_cache() is accurate.
+ *
+ * - With read mmu_lock, the per-VM external split cache can be shared among
+ *   multiple users. Cache consumption in tdx_sept_split_private_spte() thus
+ *   needs to check again of the cache page count after acquiring its internal
+ *   split cache lock and return an error if the cache page count is not
+ *   sufficient.
+ */
+static bool need_topup_external_split_cache(struct kvm *kvm, int level)
+{
+       return kvm_x86_call(need_topup_external_per_vm_split_cache)(kvm, level);
+}
+static int topup_external_split_cache(struct kvm *kvm, int level, bool shared)
+{
+       int r;
+
+       rcu_read_unlock();
+
+       if (shared)
+               read_unlock(&kvm->mmu_lock);
+       else
+               write_unlock(&kvm->mmu_lock);
+
+       r = kvm_x86_call(topup_external_per_vm_split_cache)(kvm, level);
+
+       if (shared)
+               read_lock(&kvm->mmu_lock);
+       else
+               write_lock(&kvm->mmu_lock);
+
+       if (!r)
+               rcu_read_lock();
+
+       return r;
+}
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
                                         struct kvm_mmu_page *root,
                                         gfn_t start, gfn_t end,
@@ -1673,6 +1723,23 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
                        continue;
                }
 
+               if (is_mirror_sp(root) &&
+                   need_topup_external_split_cache(kvm, iter.level)) {
+                       int r;
+
+                       r = topup_external_split_cache(kvm, iter.level, shared);
+
+                       if (r) {
+                               trace_kvm_mmu_split_huge_page(iter.gfn,
+                                                             iter.old_spte,
+                                                             iter.level, r);
+                               return r;
+                       }
+
+                       iter.yielded = true;
+                       continue;
+               }
+
                tdp_mmu_init_child_sp(sp, &iter);
 
                if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))