[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aW2XfpmV7FqO2HpA@yzhao56-desk.sh.intel.com>
Date: Mon, 19 Jan 2026 10:31:26 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Rick Edgecombe <rick.p.edgecombe@...el.com>, <bp@...en8.de>,
<chao.gao@...el.com>, <dave.hansen@...el.com>, <isaku.yamahata@...el.com>,
<kai.huang@...el.com>, <kas@...nel.org>, <kvm@...r.kernel.org>,
<linux-coco@...ts.linux.dev>, <linux-kernel@...r.kernel.org>,
<mingo@...hat.com>, <pbonzini@...hat.com>, <tglx@...utronix.de>,
<vannapurve@...gle.com>, <x86@...nel.org>, <xiaoyao.li@...el.com>,
<binbin.wu@...el.com>
Subject: Re: [PATCH v4 11/16] KVM: TDX: Add x86 ops for external spt cache
On Fri, Jan 16, 2026 at 04:53:57PM -0800, Sean Christopherson wrote:
> On Thu, Nov 20, 2025, Rick Edgecombe wrote:
> > Move mmu_external_spt_cache behind x86 ops.
> >
> > In the mirror/external MMU concept, the KVM MMU manages a non-active EPT
> > tree for private memory (the mirror). The actual active EPT tree the
> > private memory is protected inside the TDX module. Whenever the mirror EPT
> > is changed, it needs to call out into one of a set of x86 opts that
> > implement various update operation with TDX specific SEAMCALLs and other
> > tricks. These implementations operate on the TDX S-EPT (the external).
> >
> > In reality these external operations are designed narrowly with respect to
> > TDX particulars. On the surface, what TDX specific things are happening to
> > fulfill these update operations are mostly hidden from the MMU, but there
> > is one particular area of interest where some details leak through.
> >
> > The S-EPT needs pages to use for the S-EPT page tables. These page tables
> > need to be allocated before taking the mmu lock, like all the rest. So the
> > KVM MMU pre-allocates pages for TDX to use for the S-EPT in the same place
> > where it pre-allocates the other page tables. It’s not too bad and fits
> > nicely with the others.
> >
> > However, Dynamic PAMT will need even more pages for the same operations.
> > Further, these pages will need to be handed to the arch/x86 side which used
> > them for DPAMT updates, which is hard for the existing KVM based cache.
> > The details living in core MMU code start to add up.
> >
> > So in preparation to make it more complicated, move the external page
> > table cache into TDX code by putting it behind some x86 ops. Have one for
> > topping up and one for allocation. Don’t go so far to try to hide the
> > existence of external page tables completely from the generic MMU, as they
> > are currently stored in their mirror struct kvm_mmu_page and it’s quite
> > handy.
> >
> > To plumb the memory cache operations through tdx.c, export some of
> > the functions temporarily. This will be removed in future changes.
> >
> > Acked-by: Kiryl Shutsemau <kas@...nel.org>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> > ---
>
> NAK. I kinda sorta get why you did this? But the pages KVM uses for page tables
> are KVM's, not to be mixed with PAMT pages.
>
> Eww. Definitely a hard "no". In tdp_mmu_alloc_sp_for_split(), the allocation
> comes from KVM:
>
> if (mirror) {
> sp->external_spt = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> if (!sp->external_spt) {
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> return NULL;
> }
> }
>
> But then in kvm_tdp_mmu_map(), via kvm_mmu_alloc_external_spt(), the allocation
> comes from get_tdx_prealloc_page()
>
> static void *tdx_alloc_external_fault_cache(struct kvm_vcpu *vcpu)
> {
> struct page *page = get_tdx_prealloc_page(&to_tdx(vcpu)->prealloc);
>
> if (WARN_ON_ONCE(!page))
> return (void *)__get_free_page(GFP_ATOMIC | __GFP_ACCOUNT);
>
> return page_address(page);
> }
>
> But then regardles of where the page came from, KVM frees it. Seriously.
>
> static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> free_page((unsigned long)sp->external_spt); <=====
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
IMHO, it's by design. I don't see a problem with KVM freeing the sp->external_spt,
regardless of whether it's from:
(1) KVM's mmu cache,
(2) tdp_mmu_alloc_sp_for_split(), or
(3) tdx_alloc_external_fault_cache().
Please correct me if I missed anything.
None of (1)-(3) keeps the pages in list after KVM obtains the pages and maps
them into SPTEs.
So, with SPTEs as the pages' sole consumer, it's perfectly fine for KVM to free
the pages when freeing SPTEs. No?
Also, in the current upstream code, after tdp_mmu_split_huge_pages_root() is
invoked for dirty tracking, some sp->spt are allocated from
tdp_mmu_alloc_sp_for_split(), while others are from kvm_mmu_memory_cache_alloc().
However, tdp_mmu_free_sp() can still free them without any problem.
> Oh, and the hugepage series also fumbles its topup (why there's yet another
> topup API, I have no idea).
Introducing another topup API is because in the hugepage usage, the split occurs
in a non-vCPU context. So, the "kvm_tdx->prealloc_split_cache" is per-VM, since
hugepage cannot reuse tdx_topup_external_fault_cache().
(Please also see the patch log of
"[PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting" [1]).
[1] https://lore.kernel.org/all/20260106102331.25244-1-yan.y.zhao@intel.com.
> static int tdx_topup_vm_split_cache(struct kvm *kvm, enum pg_level level)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct tdx_prealloc *prealloc = &kvm_tdx->prealloc_split_cache;
> int cnt = tdx_min_split_cache_sz(kvm, level);
>
> while (READ_ONCE(prealloc->cnt) < cnt) {
> struct page *page = alloc_page(GFP_KERNEL); <==== GFP_KERNEL_ACCOUNT
Thanks for caching. Will use GFP_KERNEL_ACCOUNT instead.
> if (!page)
> return -ENOMEM;
>
> spin_lock(&kvm_tdx->prealloc_split_cache_lock);
I didn't have tdx_topup_vm_split_cache() reuse topup_tdx_prealloc_page() (as
used by tdx_topup_external_fault_cache()). This is due to the need to add the
spinlock kvm_tdx->prealloc_split_cache_lock to protect page enqueuing and
dequeuing. (I've pasted the explanation of the per-VM external cache for
splitting and its lock protections from [2] below as an "Appendix" for your
convenience).
If we can have the split for huge pages not reuse
tdp_mmu_split_huge_pages_root() (i.e., create a specific interface for
kvm_split_cross_boundary_leafs() instead of reusing
tdp_mmu_split_huge_pages_root(), as I asked about in [3]), we can preallocate
enough pages and hold mmu_lock without releasing it for memory allocation. Then,
this spinlock kvm_tdx->prealloc_split_cache_lock would not be needed.
[2] https://lore.kernel.org/all/20260106101646.24809-1-yan.y.zhao@intel.com
[3] https://lore.kernel.org/all/aW2Iwpuwoyod8eQc@yzhao56-desk.sh.intel.com
> list_add(&page->lru, &prealloc->page_list);
> prealloc->cnt++;
> spin_unlock(&kvm_tdx->prealloc_split_cache_lock);
> }
>
> return 0;
> }
Appendix:
Explanation of the per-VM external cache for splitting huge pages from the
cover letter of huge pages v3 [2]:
"
9. DPAMT
Currently, DPAMT's involvement with TDX huge page is limited to page
splitting.
As shown in the following call stack, DPAMT pages used by splitting are
pre-allocated and queued in the per-VM external split cache. They are
dequeued and consumed in tdx_sept_split_private_spte().
kvm_split_cross_boundary_leafs
kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs
tdp_mmu_split_huge_pages_root
(*) 1) tdp_mmu_alloc_sp_for_split()
+-----2.1) need_topup_external_split_cache(): check if enough pages in
| the external split cache. Go to 3 if pages are enough.
| +--2.2) topup_external_split_cache(): preallocate/enqueue pages in
| | the external split cache.
| | 3) tdp_mmu_split_huge_page
| | tdp_mmu_link_sp
| | tdp_mmu_iter_set_spte
| |(**) tdp_mmu_set_spte
| | split_external_spte
| | kvm_x86_call(split_external_spte)
| | tdx_sept_split_private_spte
| | 3.1) BLOCK, TRACK
+--+-------------------3.2) Dequeue PAMT pages from the external split
| | cache for the new sept page
| | 3.3) PAMT_ADD for the new sept page
+--+-------------------3.4) Dequeue PAMT pages from the external split
cache for the 2MB guest private memory.
3.5) DEMOTE.
3.6) Update PAMT refcount of the 2MB guest private
memory.
(*) The write mmu_lock is held across the checking of enough pages in
cache in step 2.1 and the page dequeuing in steps 3.2 and 3.4, so
it's ensured that dequeuing has enough pages in cache.
(**) A spinlock prealloc_split_cache_lock is used inside the TDX's cache
implementation to protect page enqueuing in step 2.2 and page
dequeuing in steps 3.2 and 3.4.
"
Powered by blists - more mailing lists