linux-kernel - Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <313244596b4459326bf65f50849cafe434c76813.camel@intel.com>
Date: Wed, 21 Jan 2026 23:01:06 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org"
	<david@...nel.org>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>, "kas@...nel.org"
	<kas@...nel.org>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
	"francescolavra.fl@...il.com" <francescolavra.fl@...il.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"Gao, Chao" <chao.gao@...el.com>, "Edgecombe, Rick P"
	<rick.p.edgecombe@...el.com>, "Miao, Jun" <jun.miao@...el.com>, "Annapurve,
 Vishal" <vannapurve@...gle.com>, "jgross@...e.com" <jgross@...e.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for
 splitting

On Wed, 2026-01-21 at 09:30 -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kai Huang wrote:
> > On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> > I have been thinking whether we can simplify the solution, not only just
> > for avoiding this complicated memory cache topup-then-consume mechanism
> > under MMU read lock, but also for avoiding kinda duplicated code about how
> > to calculate how many DPAMT pages needed to topup etc between your next
> > patch and similar code in DPAMT series for the per-vCPU cache.
> > 
> > IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
> > and the mapped 2M range when splitting.
> > 
> > - For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
> > tdx_alloc_page() directly which also handles DPAMT pages internally.
> > 
> > Here in tdp_mmmu_alloc_sp_for_split():
> > 
> > 	sp->external_spt = tdx_alloc_page();
> > 
> > For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
> > that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
> > mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
> > add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':
> > 
> > https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/
> > 
> > So we can get rid of the per-VM DPAMT cache for S-EPT pages.
> > 
> > - For DPAMT pages for the TDX guest private memory, I think we can also
> > get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
> > needed DPAMT pages:
> > 
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -111,6 +111,7 @@ struct kvm_mmu_page {
> >                  * Passed to TDX module, not accessed by KVM.
> >                  */
> >                 void *external_spt;
> > +               void *leaf_level_private;
> >         };
> 
> There's no need to put this in with external_spt, we could throw it in a new union
> with unsync_child_bitmap (TDP MMU can't have unsync children).
> 

Agreed.

> IIRC, the main
> reason I've never suggested unionizing unsync_child_bitmap is that overloading
> the bitmap would risk corruption if KVM ever marked a TDP MMU page as unsync, but
> that's easy enough to guard against:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3d568512201d..d6c6768c1f50 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1917,9 +1917,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
>  
>  static void mark_unsync(u64 *spte)
>  {
> -       struct kvm_mmu_page *sp;
> +       struct kvm_mmu_page *sp = sptep_to_sp(spte);
>  
> -       sp = sptep_to_sp(spte);
> +       if (WARN_ON_ONCE(is_tdp_mmu_page(sp)))
> +               return;
>         if (__test_and_set_bit(spte_index(spte), sp->unsync_child_bitmap))
>                 return;
>         if (sp->unsync_children++)
> 

LGTM.

> 
> I might send a patch to do that even if we don't overload the bitmap, as a
> hardening measure.
> 
> > Then we can define a structure which contains DPAMT pages for a given 2M
> > range:
> > 
> > 	struct tdx_dmapt_metadata {
> > 		struct page *page1;
> > 		struct page *page2;
> > 	};
> > 
> > Then when we allocate sp->external_spt, we can also allocate it for
> > leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
> > last level page table.
> > 
> > In this case, I think we can get rid of the per-VM DPAMT cache?
> > 
> > For the fault path, similarly, I believe we can use a per-vCPU cache for
> > 'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
> > hooks.
> > 
> > The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
> > TDX guests even they are never used, but if what I said above is feasible,
> > maybe it's worth the cost.
> > 
> > But it's completely possible that I missed something.  Any thoughts?
> 
> I *LOVE* the core idea (seriously, this made my week), though I think we should
> take it a step further and _immediately_ do DPAMT maintenance on allocation.
> I.e. do tdx_pamt_get() via tdx_alloc_control_page() when KVM tops up the S-EPT
> SP cache instead of waiting until KVM links the SP.  Then KVM doesn't need to
> track PAMT pages except for memory that is mapped into a guest, and we end up
> with better symmetry and more consistency throughout TDX.  E.g. all pages that
> KVM allocates and gifts to the TDX-Module will allocated and freed via the same
> TDX APIs.

Agreed.

Nit: do you still want to use the name tdx_alloc_control_page() instead of
tdx_alloc_page() if it is also used for S-EPT pages?  Kinda not sure but
both work for me.