linux-kernel - Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXHLsorSWHRslpZh@yzhao56-desk.sh.intel.com>
Date: Thu, 22 Jan 2026 15:03:14 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Kai Huang <kai.huang@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, Fan Du
	<fan.du@...el.com>, Xiaoyao Li <xiaoyao.li@...el.com>, Chao Gao
	<chao.gao@...el.com>, Dave Hansen <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org"
	<david@...nel.org>, "kas@...nel.org" <kas@...nel.org>, "michael.roth@....com"
	<michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, Isaku Yamahata <isaku.yamahata@...el.com>, "Chao P
 Peng" <chao.p.peng@...el.com>, "francescolavra.fl@...il.com"
	<francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, Rick P Edgecombe
	<rick.p.edgecombe@...el.com>, Jun Miao <jun.miao@...el.com>,
	"jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for
 splitting

On Wed, Jan 21, 2026 at 09:30:28AM -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).  IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 
> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};

Note: we need 4 pages to split a 2MB range, 2 for the new S-EPT page, 2 for the
2MB guest memory range.


> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
Me too!

> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.
Not sure if I understand this paragraph correctly.

I'm wondering if it can help us get rid of asymmetry. e.g.
When KVM wants to split a 2MB page, it allocates a sp for level 4K, which
contains 2 PAMT pages for the new S-EPT page.
During split, the 2 PAMT pages are installed successfully. However, the
splitting fails due to DEMOTE failure. Then, it looks like KVM needs to
uninstall and free the 2 PAMT pages for the new S-EPT page, right?

However, some other threads may have consumed the 2 PAMT pages for an adjacent
4KB page within the same 2MB range of the new S-EPT page.
So, KVM still can't free the 2 PAMT pages allocated from it.

Will check your patches for better understanding.

> Absolute worst case scenario, KVM allocates 40 (KVM's SP cache capacity) PAMT
> entries per-vCPU that end up being free without ever being gifted to the TDX-Module.
> But I doubt that will be a problem in practice, because odds are good the adjacent
> pages/pfns will already have been consumed, i.e. the "speculative" allocation is
> really just bumping the refcount.  And _if_ it's a problem, e.g. results in too
> many wasted DPAMT entries, then it's one we can solve in KVM by tuning the cache
> capacity to less aggresively allocate DPAMT entries.
> 
> I'll send compile-tested v4 for the DPAMT series later today (I think I can get
> it out today), as I have other non-trival feedback that I've accumulated when
> going through the patches.