lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYpjNrtGmogNzqwT@google.com>
Date: Mon, 9 Feb 2026 14:44:06 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Kai Huang <kai.huang@...el.com>
Cc: Yan Y Zhao <yan.y.zhao@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, Xiaoyao Li <xiaoyao.li@...el.com>, 
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "kas@...nel.org" <kas@...nel.org>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "mingo@...hat.com" <mingo@...hat.com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "ackerleytng@...gle.com" <ackerleytng@...gle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Isaku Yamahata <isaku.yamahata@...el.com>, 
	"sagis@...gle.com" <sagis@...gle.com>, "tglx@...nel.org" <tglx@...nel.org>, 
	Rick P Edgecombe <rick.p.edgecombe@...el.com>, "bp@...en8.de" <bp@...en8.de>, 
	Vishal Annapurve <vannapurve@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH v5 20/45] KVM: x86/mmu: Allocate/free S-EPT pages
 using tdx_{alloc,free}_control_page()

On Mon, Feb 09, 2026, Kai Huang wrote:
> > Option #2 would be to immediately free the page in tdx_sept_reclaim_private_sp(),
> > so that pages that freed via handle_removed_pt() don't defer freeing the S-EPT
> > page table (which, IIUC, is safe since the TDX-Module forces TLB flushes and exits).
> > 
> > I really, really don't like this option (if it even works).
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index ae7b9beb3249..4726011ad624 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -2014,7 +2014,15 @@ static void tdx_sept_reclaim_private_sp(struct kvm *kvm, gfn_t gfn,
> >          */
> >         if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> >             tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > -               sp->external_spt = NULL;
> > +               goto out;
> > +
> > +       /*
> > +        * Immediately free the control page, as the TDX subsystem doesn't
> > +        * support freeing pages from RCU callbacks.
> > +        */
> > +       tdx_free_control_page((unsigned long)sp->external_spt);
> > +out:
> > +       sp->external_spt = NULL;
> >  }
> >  
> >  void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> 
> I don't think this is so bad, given we already have a bunch of
> 
> 	is_mirror_sp(sp)
> 		kvm_x86_call(xx_external_spt)(..);
> 
> in TDP MMU code?
> 
> I suppose this won't make a lot of difference, but does below make you
> slightly happier?

Heh, slightly, but I still don't love it :-D

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3181406c5e0b..3588265098a8 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -55,8 +55,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> -       if (sp->external_spt)
> -               kvm_x86_call(free_external_sp)((unsigned long)sp-
> >external_spt);
> +       WARN_ON_ONCE(sp->external_spt);

Doesn't work, because sp->external_spt will be non-NULL when KVM is freeing
unused pages in tdp_mmu_split_huge_pages_root() and kvm_tdp_mmu_map().  That's
solvable, but it's part of the asymmetry I don't love.  AFAICT, unless we do
something truly awful, there's no way to avoid having common KVM free unused
S-EPT pages.

That said, while I don't love the asymmetry, it's not a deal breaker, especially
if we make the asymmetry super obvious and cleanly delineated.  Specifically, if
we differentiate between freeing unused page tables and freeing used (linked at
any point) page tables.

This would also allow us to address the naming than Yan doesn't like around
reclaim_external_sp(), because we could have both free_external_sp() and
free_unused_external_spt(), where the lack of "unused" gives the reader a hint
that there's interesting work to be done for in-use external page tables.

This won't apply cleanly due to other fixups.  It's also at:

  https://github.com/sean-jc/linux.git x86/tdx_huge_sept

---
 arch/x86/include/asm/kvm-x86-ops.h |  8 ++++----
 arch/x86/include/asm/kvm_host.h    | 10 +++++-----
 arch/x86/kvm/mmu/mmu.c             |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c         | 27 +++++++++++++++++----------
 arch/x86/kvm/vmx/tdx.c             | 22 +++++++++++++++-------
 5 files changed, 43 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 87826176fa8d..9593d8d97f6b 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -94,11 +94,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
-KVM_X86_OP_OPTIONAL(alloc_external_sp)
-KVM_X86_OP_OPTIONAL(free_external_sp)
-KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL(reclaim_external_sp)
+KVM_X86_OP_OPTIONAL_RET0(alloc_external_spt)
+KVM_X86_OP_OPTIONAL(free_external_spt)
+KVM_X86_OP_OPTIONAL(free_unused_external_spt)
 KVM_X86_OP_OPTIONAL_RET0(topup_private_mapping_cache)
+KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ff72c04d575..5fc25508548b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1855,13 +1855,13 @@ struct kvm_x86_ops {
 	 * and to propagate changes in mirror page tables to the external page
 	 * tables.
 	 */
-	unsigned long (*alloc_external_sp)(gfp_t gfp);
-	void (*free_external_sp)(unsigned long addr);
+	unsigned long (*alloc_external_spt)(gfp_t gfp);
+	void (*free_external_spt)(struct kvm *kvm, gfn_t gfn,
+				  struct kvm_mmu_page *sp);
+	void (*free_unused_external_spt)(unsigned long external_spt);
+	int (*topup_private_mapping_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu);
 	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				 u64 new_spte, enum pg_level level);
-	void (*reclaim_external_sp)(struct kvm *kvm, gfn_t gfn,
-				    struct kvm_mmu_page *sp);
-	int (*topup_private_mapping_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a911aec075b..2ad417ac6e1f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6714,8 +6714,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.mmu_shadow_page_cache.init_value)
 		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
-	vcpu->arch.mmu_external_spt_cache.page_get = kvm_x86_ops.alloc_external_sp;
-	vcpu->arch.mmu_external_spt_cache.page_free = kvm_x86_ops.free_external_sp;
+	vcpu->arch.mmu_external_spt_cache.page_get = kvm_x86_ops.alloc_external_spt;
+	vcpu->arch.mmu_external_spt_cache.page_free = kvm_x86_ops.free_unused_external_spt;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b7c13316181b..d43db86b12a7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,12 +53,18 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
-static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+static void __tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+{
+	free_page((unsigned long)sp->spt);
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+static void tdp_mmu_free_unused_sp(struct kvm_mmu_page *sp)
 {
 	if (sp->external_spt)
-		kvm_x86_call(free_external_sp)((unsigned long)sp->external_spt);
-	free_page((unsigned long)sp->spt);
-	kmem_cache_free(mmu_page_header_cache, sp);
+		kvm_x86_call(free_unused_external_spt)((unsigned long)sp->external_spt);
+
+	__tdp_mmu_free_sp(sp);
 }
 
 /*
@@ -74,7 +80,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 	struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
 					       rcu_head);
 
-	tdp_mmu_free_sp(sp);
+	WARN_ON_ONCE(sp->external_spt);
+	__tdp_mmu_free_sp(sp);
 }
 
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
@@ -458,7 +465,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	}
 
 	if (is_mirror_sp(sp))
-		kvm_x86_call(reclaim_external_sp)(kvm, base_gfn, sp);
+		kvm_x86_call(free_external_spt)(kvm, base_gfn, sp);
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
@@ -1266,7 +1273,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * failed, e.g. because a different task modified the SPTE.
 		 */
 		if (r) {
-			tdp_mmu_free_sp(sp);
+			tdp_mmu_free_unused_sp(sp);
 			goto retry;
 		}
 
@@ -1461,7 +1468,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		goto err_spt;
 
 	if (is_mirror_sp) {
-		sp->external_spt = (void *)kvm_x86_call(alloc_external_sp)(GFP_KERNEL_ACCOUNT);
+		sp->external_spt = (void *)kvm_x86_call(alloc_external_spt)(GFP_KERNEL_ACCOUNT);
 		if (!sp->external_spt)
 			goto err_external_spt;
 
@@ -1472,7 +1479,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	return sp;
 
 err_external_split:
-	kvm_x86_call(free_external_sp)((unsigned long)sp->external_spt);
+	kvm_x86_call(free_unused_external_spt)((unsigned long)sp->external_spt);
 err_external_spt:
 	free_page((unsigned long)sp->spt);
 err_spt:
@@ -1594,7 +1601,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * installs its own sp in place of the last sp we tried to split.
 	 */
 	if (sp)
-		tdp_mmu_free_sp(sp);
+		tdp_mmu_free_unused_sp(sp);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 957fa59e9a65..aae7af26fe02 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1994,8 +1994,8 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 	return tdx_sept_map_leaf_spte(kvm, gfn, new_spte, level);
 }
 
-static void tdx_sept_reclaim_private_sp(struct kvm *kvm, gfn_t gfn,
-					struct kvm_mmu_page *sp)
+static void tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+				      struct kvm_mmu_page *sp)
 {
 	/*
 	 * KVM doesn't (yet) zap page table pages in mirror page table while
@@ -2013,7 +2013,16 @@ static void tdx_sept_reclaim_private_sp(struct kvm *kvm, gfn_t gfn,
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
 	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
-		sp->external_spt = NULL;
+		goto out;
+
+	/*
+	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
+	 * freeing pages from RCU callbacks, and more importantly because
+	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
+	 */
+	tdx_free_control_page((unsigned long)sp->external_spt);
+out:
+	sp->external_spt = NULL;
 }
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
@@ -3874,11 +3883,10 @@ void __init tdx_hardware_setup(void)
 	 * initialize the page (using the guest's encryption key), and (b) need
 	 * to use a custom allocator to be compatible with Dynamic PAMT.
 	 */
-	vt_x86_ops.alloc_external_sp = tdx_alloc_control_page;
-	vt_x86_ops.free_external_sp = tdx_free_control_page;
-
+	vt_x86_ops.alloc_external_spt = tdx_alloc_control_page;
+	vt_x86_ops.free_unused_external_spt = tdx_free_control_page;
+	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
-	vt_x86_ops.reclaim_external_sp = tdx_sept_reclaim_private_sp;
 
 	vt_x86_ops.gmem_convert = tdx_gmem_convert;
 

base-commit: 0191132f233a66b5cb1ae9a09c18c6d6669c284a
--

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ