[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXENNKjAKTM9UJNH@google.com>
Date: Wed, 21 Jan 2026 09:30:28 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Kai Huang <kai.huang@...el.com>
Cc: "pbonzini@...hat.com" <pbonzini@...hat.com>, Yan Y Zhao <yan.y.zhao@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, Fan Du <fan.du@...el.com>,
Xiaoyao Li <xiaoyao.li@...el.com>, Chao Gao <chao.gao@...el.com>,
Dave Hansen <dave.hansen@...el.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org" <david@...nel.org>,
"kas@...nel.org" <kas@...nel.org>, "michael.roth@....com" <michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
Isaku Yamahata <isaku.yamahata@...el.com>, Chao P Peng <chao.p.peng@...el.com>,
"francescolavra.fl@...il.com" <francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>,
Vishal Annapurve <vannapurve@...gle.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>,
Jun Miao <jun.miao@...el.com>, "jgross@...e.com" <jgross@...e.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting
On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
>
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
>
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
>
> Here in tdp_mmmu_alloc_sp_for_split():
>
> sp->external_spt = tdx_alloc_page();
>
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache. As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
>
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
>
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
>
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
>
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> * Passed to TDX module, not accessed by KVM.
> */
> void *external_spt;
> + void *leaf_level_private;
> };
There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children). IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
static void mark_unsync(u64 *spte)
{
- struct kvm_mmu_page *sp;
+ struct kvm_mmu_page *sp = sptep_to_sp(spte);
- sp = sptep_to_sp(spte);
+ if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+ return;
if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
return;
if (sp->unsync_children++)
I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.
> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
>
> struct tdx_dmapt_metadata {
> struct page *page1;
> struct page *page2;
> };
>
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
>
> In this case, I think we can get rid of the per-VM DPAMT cache?
>
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
>
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
>
> But it's completely possible that I missed something. Any thoughts?
I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP. Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX. E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.
Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount. And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.
I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.
Powered by blists - more mailing lists