linux-kernel - Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal to S-EPT in handle_changed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYsOV7Q5FTWo+6/x@yzhao56-desk.sh.intel.com>
Date: Tue, 10 Feb 2026 18:54:15 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	<x86@...nel.org>, Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini
	<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
	<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>, Kai Huang
	<kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
	Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>,
	Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
 to S-EPT in handle_changed_spte()

On Fri, Feb 06, 2026 at 09:41:38AM -0800, Sean Christopherson wrote:
> On Fri, Feb 06, 2026, Yan Zhao wrote:
> > On Thu, Feb 05, 2026 at 02:33:16PM -0800, Sean Christopherson wrote:
> > > On Thu, Feb 05, 2026, Yan Zhao wrote:
> > > > > > >  	if (was_present && !was_leaf &&
> > > > > > >  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > > > > > >  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> > > > > > > +	else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf)
> > > > > > Should we check !is_present instead of !is_leaf?
> > > > > > e.g. a transition from a present leaf entry to a present non-leaf entry could
> > > > > > also trigger this if case.
> > > > > 
> > > > > No, the !is_leaf check is very intentional.  At this point in the series, S-EPT
> > > > > doesn't support hugepages.  If KVM manages to install a leaf SPTE and replaces
> > > > > that SPTE with a non-leaf SPTE, then we absolutely want the KVM_BUG_ON() in
> > > > > tdx_sept_remove_private_spte() to fire:
> > > > > 
> > > > > 	/* TODO: handle large pages. */
> > > > > 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > > > 		return -EIO;
> > > > But the op is named remove_external_spte().
> > > > And the check of "level != PG_LEVEL_4K" is for removing large leaf entries.
> > > 
> > > I agree that the naming at this point in the series is unfortunate, but I don't
> > > see it as outright wrong.  That the TDP MMU could theoretically replace the leaf
> > > SPTE with a non-leaf SPTE doesn't change the fact that the old leaf SPTE *is*
> > > being removed.
> > Hmm, I can't agree with that. But I won't insist if you think it's ok :)
> 
> If the code is read through a TDX lens, then I agree, it's seems wrong.  Because
> then you *know* that TDX doesn't support back-to-back remove()=>add() operations
> to handle a page split.
> 
> But from a TDP MMU perspective, this is entirely logical (ignoring that
> link_external_spt() is gone at this point in the series).
> 
> 	else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf) {
> 		kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte);
> 
> 		/*
> 		 * Link the new page table if a hugepage is being split, i.e.
> 		 * if a leaf SPTE is being replaced with a non-leaf SPTE.
> 		 */
> 		if (is_present)
> 			kvm_x86_call(link_external_spt)(kvm, gfn, level, ...);
> 	}
>
> And that is *exactly* why I want to get rid of the hyper-specific kvm_x86_ops
> hooks.  They bleed all kinds of implementation details all over the TDP MMU, which
> makes it difficult to read and understand the relevant TDP MMU code if you don't
> already know the TDX rules.  And I absolutely do not want to effectively require
> others to understand TDX's rules to be able to make changes to the TDP MMU.
Ok. I can understand your reasoning of checking !is_leaf now.
Thanks for the explanation!

Though I still think checking !is_present before calling op remove_external_spte()
in this patch is better, I have no strong opinion :)

... 
> Nope, as above, 100% the opposite.  Over ~3 patches, e.g.
> 
>  1. Drop the KVM_BUG_ON()s or move them to TDX
>  2. Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
>  3. Rework the code to rely on __handle_changed_spte() to propagate writes to S-EPT
> 
> That way, the _only_ path that updates external SPTEs is this:
> 
> 	if (was_present && !was_leaf &&
> 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> 	else if (is_mirror_sptep(sptep))
> 		return kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> 						       new_spte, level);
> 
> which is fully aligned with handle_changed_spte()'s role for !mirror roots: it
> exists to react to changes (the sole exception to that being aging SPTEs, which
> is a special case).
> 
> Compile-tested only.
LGTM overall.

>  arch/x86/kvm/mmu/tdp_mmu.c | 118 ++++++++++++++++++-------------------
>  1 file changed, 59 insertions(+), 59 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 847f2fcb6740..33a321aedac0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -464,7 +464,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  }
>  
>  /**
> - * handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * __handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
>   * @as_id: the address space of the paging structure the SPTE was a part of
>   * @sptep: pointer to the SPTE
> @@ -480,9 +480,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   * dirty logging updates are handled in common code, not here (see make_spte()
>   * and fast_pf_fix_direct_spte()).
>   */
> -static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> -				gfn_t gfn, u64 old_spte, u64 new_spte,
> -				int level, bool shared)
> +static int __handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> +				 gfn_t gfn, u64 old_spte, u64 new_spte,
> +				 int level, bool shared)
>  {
>  	bool was_present = is_shadow_present_pte(old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
> @@ -518,7 +518,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  	}
>  
>  	if (old_spte == new_spte)
> -		return;
> +		return 0;
>  
>  	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
>  
> @@ -547,7 +547,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  			       "a temporary frozen SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
>  	if (is_leaf != was_leaf)
> @@ -559,30 +559,31 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>  	 * pages are kernel allocations and should never be migrated.
>  	 *
> -	 * When modifying leaf entries in mirrored page tables, propagate the
> -	 * changes to the external SPTE.  Bug the VM on failure, as callers
> -	 * aren't prepared to handle errors, e.g. due to lock contention in the
> -	 * TDX-Module.  Note, changes to non-leaf mirror SPTEs are handled by
> -	 * handle_removed_pt() (the TDX-Module requires that child entries are
> -	 * removed before the parent SPTE), and changes to non-present mirror
> -	 * SPTEs are handled by __tdp_mmu_set_spte_atomic() (KVM needs to set
> -	 * the external SPTE while the mirror SPTE is frozen so that installing
> -	 * a new SPTE is effectively an atomic operation).
> +	 * When modifying leaf entries in mirrored page tables, propagate all
> +	 * changes to the external SPTE.
>  	 */
>  	if (was_present && !was_leaf &&
>  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> -	else if (was_leaf && is_mirror_sptep(sptep))
> -		KVM_BUG_ON(kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> -							   new_spte, level), kvm);
> +	else if (is_mirror_sptep(sptep))
> +		return kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> +						       new_spte, level);
For TDX's future implementation of set_external_spte() for split splitting,
could we add a new param "bool shared" to op set_external_spte() in the
future? i.e.,
- when tdx_sept_split_private_spte() is invoked under write mmu_lock, it calls
  tdh_do_no_vcpus() to retry BUSY error, and TDX_BUG_ON_2() then.
- when tdx_sept_split_private_spte() is invoked under read mmu_lock
  (in the future when calling tdh_mem_range_block() in unnecessary), it could
  directly return BUSY to TDP MMU on contention.


> +	return 0;
> +}
> +
> +static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> +				gfn_t gfn, u64 old_spte, u64 new_spte,
> +				int level, bool shared)
> +{
Do we need "WARN_ON_ONCE(is_mirror_sptep(sptep) && shared)" here ? 

> +	KVM_BUG_ON(__handle_changed_spte(kvm, as_id, sptep, gfn, old_spte,
> +					 new_spte, level, shared), kvm);
>  }



>  
>  static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  							 struct tdp_iter *iter,
>  							 u64 new_spte)
>  {
> -	u64 *raw_sptep = rcu_dereference(iter->sptep);
> -
>  	/*
>  	 * The caller is responsible for ensuring the old SPTE is not a FROZEN
>  	 * SPTE.  KVM should never attempt to zap or manipulate a FROZEN SPTE,
> @@ -591,40 +592,6 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	 */
>  	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>  
> -	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
> -		int ret;
> -
> -		/*
> -		 * KVM doesn't currently support zapping or splitting mirror
> -		 * SPTEs while holding mmu_lock for read.
> -		 */
> -		if (KVM_BUG_ON(is_shadow_present_pte(iter->old_spte), kvm) ||
> -		    KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
> -			return -EBUSY;
> -
> -		/*
> -		 * Temporarily freeze the SPTE until the external PTE operation
> -		 * has completed, e.g. so that concurrent faults don't attempt
> -		 * to install a child PTE in the external page table before the
> -		 * parent PTE has been written.
> -		 */
> -		if (!try_cmpxchg64(raw_sptep, &iter->old_spte, FROZEN_SPTE))
> -			return -EBUSY;
> -
> -		/*
> -		 * Update the external PTE.  On success, set the mirror SPTE to
> -		 * the desired value.  On failure, restore the old SPTE so that
> -		 * the SPTE isn't frozen in perpetuity.
> -		 */
> -		ret = kvm_x86_call(set_external_spte)(kvm, iter->gfn, iter->old_spte,
> -						      new_spte, iter->level);
> -		if (ret)
> -			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> -		else
> -			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> -		return ret;
> -	}
> -
>  	/*
>  	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
>  	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
> @@ -632,7 +599,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	 * the current value, so the caller operates on fresh data, e.g. if it
>  	 * retries tdp_mmu_set_spte_atomic()
>  	 */
> -	if (!try_cmpxchg64(raw_sptep, &iter->old_spte, new_spte))
> +	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
>  		return -EBUSY;
>  
>  	return 0;
> @@ -663,14 +630,44 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
> -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>
> +	/* KVM should never freeze SPTEs using higher level APIs. */
> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
What about
	KVM_MMU_WARN_ON(is_frozen_spte(new_spte) ||
			is_frozen_spte(iter->old_spte) || iter->yielded);

> +	/*
> +	  * Temporarily freeze the SPTE until the external PTE operation has
> +	  * completed (unless the new SPTE itself will be frozen), e.g. so that
> +	  * concurrent faults don't attempt to install a child PTE in the
> +	  * external page table before the parent PTE has been written, or try
> +	  * to re-install a page table before the old one was removed.
> +	  */
> +	if (is_mirror_sptep(iter->sptep))
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> +	else
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
and invoking open code try_cmpxchg64() directly?

>  	if (ret)
>  		return ret;
>  
> -	handle_changed_spte(kvm, iter->as_id, iter->sptep, iter->gfn,
> -			    iter->old_spte, new_spte, iter->level, true);
> +	ret = __handle_changed_spte(kvm, iter->as_id, iter->sptep, iter->gfn,
> +				    iter->old_spte, new_spte, iter->level, true);
>  
> -	return 0;
> +	/*
> +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> +	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> +	 * otherwise set the mirror SPTE to the new desired value.
> +	 */
> +	if (is_mirror_sptep(iter->sptep)) {
> +		if (ret)
> +			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> +		else
> +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> +	} else {
> +		/*
> +		 * Bug the VM if handling the change failed, as failure is only
> +		 * allowed if KVM couldn't update the external SPTE.
> +		 */
> +		KVM_BUG_ON(ret, kvm);
> +	}
> +	return ret;
>  }
One concern for tdp_mmu_set_spte_atomic() to handle mirror SPTEs:
- Previously
  1. set *iter->sptep to FROZEN_SPTE.
  2. kvm_x86_call(set_external_spte)(old_spte, new_spte)
  3. set *iter->sptep to new_spte

- Now with this diff
  1. set *iter->sptep to FROZEN_SPTE.
  2. __handle_changed_spte()
     --> kvm_x86_call(set_external_spte)(iter->sptep, old_spte, new_spte)
  3. set *iter->sptep to new_spte 

  what if __handle_changed_spte() reads *iter->sptep in step 2?
  Passing in "bool is_mirror_sp" to __handle_changed_spte() instead?