linux-kernel - Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for splitting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b9487eba19c134c1801a536945e8ae57ea93032f.camel@intel.com>
Date: Wed, 21 Jan 2026 01:54:14 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Du, Fan" <fan.du@...el.com>,
	"Li, Xiaoyao" <xiaoyao.li@...el.com>, "Gao, Chao" <chao.gao@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"tabba@...gle.com" <tabba@...gle.com>, "david@...nel.org" <david@...nel.org>,
	"kas@...nel.org" <kas@...nel.org>, "michael.roth@....com"
	<michael.roth@....com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, "Peng,
 Chao P" <chao.p.peng@...el.com>, "francescolavra.fl@...il.com"
	<francescolavra.fl@...il.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "Edgecombe, Rick P"
	<rick.p.edgecombe@...el.com>, "Miao, Jun" <jun.miao@...el.com>,
	"jgross@...e.com" <jgross@...e.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 19/24] KVM: x86: Introduce per-VM external cache for
 splitting

On Tue, 2026-01-06 at 18:23 +0800, Yan Zhao wrote:
> Introduce per-VM external cache for splitting the external page table by
> adding KVM x86 ops for cache "topup", "free", "need topup" operations.
> 
> Invoke the KVM x86 ops for "topup", "need topup" for the per-VM external
> split cache when splitting the mirror root in
> tdp_mmu_split_huge_pages_root() where there's no per-vCPU context.
> 
> Invoke the KVM x86 op for "free" to destroy the per-VM external split cache
> when KVM frees memory caches.
> 
> This per-VM external split cache is only used when per-vCPU context is not
> available. Use the per-vCPU external fault cache in the fault path
> when per-vCPU context is available.
> 
> The per-VM external split cache is protected under both kvm->mmu_lock and a
> cache lock inside vendor implementations to ensure that there're enough
> pages in cache for one split:
> 
> - Dequeuing of the per-VM external split cache is in
>   kvm_x86_ops.split_external_spte() under mmu_lock.
> 
> - Yield the traversal in tdp_mmu_split_huge_pages_root() after topup of
>   the per-VM cache, so that need_topup() is checked again after
>   re-acquiring the mmu_lock.
> 
> - Vendor implementations of the per-VM external split cache provide a
>   cache lock to protect the enqueue/dequeue of pages into/from the cache.
> 
> Here's the sequence to show how enough pages in cache is guaranteed.
> 
> a. with write mmu_lock:
> 
>    1. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. write_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. write_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)  (*)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          dequeue pages in cache
>          put cache lock
>       }
>       write_unlock(&kvm->mmu_lock)
> 
> b. with read mmu_lock,
> 
>    1. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup()
> 
>    2. read_unlock(&kvm->mmu_lock)
>       kvm_x86_ops.topup() --> in vendor:
>       {
>         allocate pages
>         get cache lock
>         enqueue pages in cache
>         put cache lock
>       }
> 
>    3. read_lock(&kvm->mmu_lock)
>       kvm_x86_ops.need_topup() (goto 2 if topup is necessary)
> 
>       kvm_x86_ops.split_external_spte() --> in vendor:
>       {
>          get cache lock
>          kvm_x86_ops.need_topup() (return retry if topup is necessary) (**)
>          dequeue pages in cache
>          put cache lock
>       }
> 
>       read_unlock(&kvm->mmu_lock)
> 
> Due to (*) and (**) in step 3, enough pages for split is guaranteed.

It feels like enormous pain to make sure there's enough objects in the
cache, _especially_ under MMU read lock -- you need an additional cache
lock and need to call need_topup() twice for that, and the caller needs
handle -EAGAIN.

That being said, I _think_ this is also the reason that
tdp_mmu_alloc_sp_for_split() chose to just use normal memory allocation
for allocating sp and sp->spt but not use a per-VM cache of KVM's
kvm_mmu_memory_cache.

I have been thinking whether we can simplify the solution, not only just
for avoiding this complicated memory cache topup-then-consume mechanism
under MMU read lock, but also for avoiding kinda duplicated code about how
to calculate how many DPAMT pages needed to topup etc between your next
patch and similar code in DPAMT series for the per-vCPU cache.

IIRC, the per-VM DPAMT cache (in your next patch) covers both S-EPT pages
and the mapped 2M range when splitting.

- For S-EPT pages, they are _ALWAYS_ 4K, so we can actually use
tdx_alloc_page() directly which also handles DPAMT pages internally.

Here in tdp_mmmu_alloc_sp_for_split():

	sp->external_spt = tdx_alloc_page();

For the fault path we need to use the normal 'kvm_mmu_memory_cache' but
that's per-vCPU cache which doesn't have the pain of per-VM cache.  As I
mentioned in v3, I believe we can also hook to use tdx_alloc_page() if we
add two new obj_alloc()/free() callback to 'kvm_mmu_memory_cache':

https://lore.kernel.org/kvm/9e72261602bdab914cf7ff6f7cb921e35385136e.camel@intel.com/

So we can get rid of the per-VM DPAMT cache for S-EPT pages.

- For DPAMT pages for the TDX guest private memory, I think we can also
get rid of the per-VM DPAMT cache if we use 'kvm_mmu_page' to carry the
needed DPAMT pages:

--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -111,6 +111,7 @@ struct kvm_mmu_page {
                 * Passed to TDX module, not accessed by KVM.
                 */
                void *external_spt;
+               void *leaf_level_private;
        };

Then we can define a structure which contains DPAMT pages for a given 2M
range:

	struct tdx_dmapt_metadata {
		struct page *page1;
		struct page *page2;
	};

Then when we allocate sp->external_spt, we can also allocate it for
leaf_level_private via kvm_x86_ops call when we the 'sp' is actually the
last level page table.

In this case, I think we can get rid of the per-VM DPAMT cache?

For the fault path, similarly, I believe we can use a per-vCPU cache for
'struct tdx_dpamt_memtadata' if we utilize the two new obj_alloc()/free()
hooks.

The cost is the new 'leaf_level_private' takes additional 8-bytes for non-
TDX guests even they are never used, but if what I said above is feasible,
maybe it's worth the cost.

But it's completely possible that I missed something.  Any thoughts?