linux-kernel - Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXENNKjAKTM9UJNH@google.com>
Date: Wed, 21 Jan 2026 09:30:28 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Kai Huang <kai.huang@...el.com>
Cc: "pbonzini@...hat.com" <pbonzini@...hat.com>, Yan Y Zhao <yan.y.zhao@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, Fan Du <fan.du@...el.com>, 
	Xiaoyao Li <xiaoyao.li@...el.com>, Chao Gao <chao.gao@...el.com>, 
	Dave Hansen <dave.hansen@...el.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>, 
	"vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org" <david@...nel.org>, 
	"kas@...nel.org" <kas@...nel.org>, "michael.roth@....com" <michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>, 
	Isaku Yamahata <isaku.yamahata@...el.com>, Chao P Peng <chao.p.peng@...el.com>, 
	"francescolavra.fl@...il.com" <francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, Rick P Edgecombe <rick.p.edgecombe@...el.com>, 
	Jun Miao <jun.miao@...el.com>, "jgross@...e.com" <jgross@...e.com>, 
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

On Wed, Jan 21, 2026, Kai Huang wrote:
> On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> I have been thinking whether we can simplify the solution, not only just
> for avoiding this complicated memory cache topup-then-consume mechanism
> under MMU read lock, but also for avoiding kinda duplicated code about how
> to calculate how many DPAMT pages needed to topup etc between your next
> patch and similar code in DPAMT series for the per-vCPU cache.
> 
> IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> and the mapped 2M range when splitting.
> 
> - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> tdx_alloc_page() directly which also handles DPAMT pages internally.
> 
> Here in tdp_mmmu_alloc_sp_for_split():
> 
> 	sp->external_spt = tdx_alloc_page();
> 
> For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> 
> https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> 
> So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> 
> - For DPAMT pages for the TDX guest private memory, I think we can also
> get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> needed DPAMT pages:
> 
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -111,6 +111,7 @@ struct kvm_mmu_page {
>                  * Passed to TDX module, not accessed by KVM.
>                  */
>                 void *external_spt;
> +               void *leaf_level_private;
>         };

There's no need to put this in with external_spt, we could throw it in a new union
with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
reason I've never suggested unionizing unsync_child_bitmap is that overloading
the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
that's easy enough to guard against:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d568512201d..d6c6768c1f50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
 
 static void mark_unsync(u64 *spte)
 {
-       struct kvm_mmu_page *sp;
+       struct kvm_mmu_page *sp = sptep_to_sp(spte);
 
-       sp = sptep_to_sp(spte);
+       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
+               return;
        if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
                return;
        if (sp->unsync_children++)


I might send a patch to do that even if we don't overload the bitmap, as a
hardening measure.

> Then we can define a structure which contains DPAMT pages for a given 2M
> range:
> 
> 	struct tdx_dmapt_metadata {
> 		struct page *page1;
> 		struct page *page2;
> 	};
> 
> Then when we allocate sp->external_spt, we can also allocate it for
> leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> last level page table.
> 
> In this case, I think we can get rid of the per-VM DPAMT cache?
> 
> For the fault path, similarly, I believe we can use a per-vCPU cache for
> 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> hooks.
> 
> The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> TDX guests even they are never used, but if what I said above is feasible,
> maybe it's worth the cost.
> 
> But it's completely possible that I missed something.  Any thoughts?

I *LOVE* the core idea (seriously, this made my week), though I think we should
take it a step further and _immediately_ do DPAMT maintenance on allocation.
I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
track PAMT pages except for memory that is mapped into a guest, and we end up
with better symmetry and more consistency throughout TDX.  E.g. all pages that
KVM allocates and gifts to the TDX-Module will allocated and freed via the same
TDX APIs.

Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
But I doubt that will be a problem in practice, because odds are good the adjacent
pages/pfns will already have been consumed, i.e. the "speculative" allocation is
really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
capacity to less aggresively allocate DPAMT entries.

I'll send compile-tested v4 for the DPAMT series later today (I think I can get
it out today), as I have other non-trival feedback that I've accumulated when
going through the patches.