[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYsOV7Q5FTWo+6/x@yzhao56-desk.sh.intel.com>
Date: Tue, 10 Feb 2026 18:54:15 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini
<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>, Kai Huang
<kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
Annapurve" <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>,
Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
to S-EPT in handle_changed_spte()
On Fri, Feb 06, 2026 at 09:41:38AM -0800, Sean Christopherson wrote:
> On Fri, Feb 06, 2026, Yan Zhao wrote:
> > On Thu, Feb 05, 2026 at 02:33:16PM -0800, Sean Christopherson wrote:
> > > On Thu, Feb 05, 2026, Yan Zhao wrote:
> > > > > > > if (was_present && !was_leaf &&
> > > > > > > (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > > > > > > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> > > > > > > + else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf)
> > > > > > Should we check !is_present instead of !is_leaf?
> > > > > > e.g. a transition from a present leaf entry to a present non-leaf entry could
> > > > > > also trigger this if case.
> > > > >
> > > > > No, the !is_leaf check is very intentional. At this point in the series, S-EPT
> > > > > doesn't support hugepages. If KVM manages to install a leaf SPTE and replaces
> > > > > that SPTE with a non-leaf SPTE, then we absolutely want the KVM_BUG_ON() in
> > > > > tdx_sept_remove_private_spte() to fire:
> > > > >
> > > > > /* TODO: handle large pages. */
> > > > > if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > > > return -EIO;
> > > > But the op is named remove_external_spte().
> > > > And the check of "level != PG_LEVEL_4K" is for removing large leaf entries.
> > >
> > > I agree that the naming at this point in the series is unfortunate, but I don't
> > > see it as outright wrong. That the TDP MMU could theoretically replace the leaf
> > > SPTE with a non-leaf SPTE doesn't change the fact that the old leaf SPTE *is*
> > > being removed.
> > Hmm, I can't agree with that. But I won't insist if you think it's ok :)
>
> If the code is read through a TDX lens, then I agree, it's seems wrong. Because
> then you *know* that TDX doesn't support back-to-back remove()=>add() operations
> to handle a page split.
>
> But from a TDP MMU perspective, this is entirely logical (ignoring that
> link_external_spt() is gone at this point in the series).
>
> else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf) {
> kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte);
>
> /*
> * Link the new page table if a hugepage is being split, i.e.
> * if a leaf SPTE is being replaced with a non-leaf SPTE.
> */
> if (is_present)
> kvm_x86_call(link_external_spt)(kvm, gfn, level, ...);
> }
>
> And that is *exactly* why I want to get rid of the hyper-specific kvm_x86_ops
> hooks. They bleed all kinds of implementation details all over the TDP MMU, which
> makes it difficult to read and understand the relevant TDP MMU code if you don't
> already know the TDX rules. And I absolutely do not want to effectively require
> others to understand TDX's rules to be able to make changes to the TDP MMU.
Ok. I can understand your reasoning of checking !is_leaf now.
Thanks for the explanation!
Though I still think checking !is_present before calling op remove_external_spte()
in this patch is better, I have no strong opinion :)
...
> Nope, as above, 100% the opposite. Over ~3 patches, e.g.
>
> 1. Drop the KVM_BUG_ON()s or move them to TDX
> 2. Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
> 3. Rework the code to rely on __handle_changed_spte() to propagate writes to S-EPT
>
> That way, the _only_ path that updates external SPTEs is this:
>
> if (was_present && !was_leaf &&
> (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> else if (is_mirror_sptep(sptep))
> return kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> new_spte, level);
>
> which is fully aligned with handle_changed_spte()'s role for !mirror roots: it
> exists to react to changes (the sole exception to that being aging SPTEs, which
> is a special case).
>
> Compile-tested only.
LGTM overall.
> arch/x86/kvm/mmu/tdp_mmu.c | 118 ++++++++++++++++++-------------------
> 1 file changed, 59 insertions(+), 59 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 847f2fcb6740..33a321aedac0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -464,7 +464,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> }
>
> /**
> - * handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> * @kvm: kvm instance
> * @as_id: the address space of the paging structure the SPTE was a part of
> * @sptep: pointer to the SPTE
> @@ -480,9 +480,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> * dirty logging updates are handled in common code, not here (see make_spte()
> * and fast_pf_fix_direct_spte()).
> */
> -static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> - gfn_t gfn, u64 old_spte, u64 new_spte,
> - int level, bool shared)
> +static int __handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> + gfn_t gfn, u64 old_spte, u64 new_spte,
> + int level, bool shared)
> {
> bool was_present = is_shadow_present_pte(old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> @@ -518,7 +518,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> }
>
> if (old_spte == new_spte)
> - return;
> + return 0;
>
> trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
>
> @@ -547,7 +547,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> "a temporary frozen SPTE.\n"
> "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
> as_id, gfn, old_spte, new_spte, level);
> - return;
> + return 0;
> }
>
> if (is_leaf != was_leaf)
> @@ -559,30 +559,31 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> * pages are kernel allocations and should never be migrated.
> *
> - * When modifying leaf entries in mirrored page tables, propagate the
> - * changes to the external SPTE. Bug the VM on failure, as callers
> - * aren't prepared to handle errors, e.g. due to lock contention in the
> - * TDX-Module. Note, changes to non-leaf mirror SPTEs are handled by
> - * handle_removed_pt() (the TDX-Module requires that child entries are
> - * removed before the parent SPTE), and changes to non-present mirror
> - * SPTEs are handled by __tdp_mmu_set_spte_atomic() (KVM needs to set
> - * the external SPTE while the mirror SPTE is frozen so that installing
> - * a new SPTE is effectively an atomic operation).
> + * When modifying leaf entries in mirrored page tables, propagate all
> + * changes to the external SPTE.
> */
> if (was_present && !was_leaf &&
> (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> - else if (was_leaf && is_mirror_sptep(sptep))
> - KVM_BUG_ON(kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> - new_spte, level), kvm);
> + else if (is_mirror_sptep(sptep))
> + return kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> + new_spte, level);
For TDX's future implementation of set_external_spte() for split splitting,
could we add a new param "bool shared" to op set_external_spte() in the
future? i.e.,
- when tdx_sept_split_private_spte() is invoked under write mmu_lock, it calls
tdh_do_no_vcpus() to retry BUSY error, and TDX_BUG_ON_2() then.
- when tdx_sept_split_private_spte() is invoked under read mmu_lock
(in the future when calling tdh_mem_range_block() in unnecessary), it could
directly return BUSY to TDP MMU on contention.
> + return 0;
> +}
> +
> +static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> + gfn_t gfn, u64 old_spte, u64 new_spte,
> + int level, bool shared)
> +{
Do we need "WARN_ON_ONCE(is_mirror_sptep(sptep) && shared)" here ?
> + KVM_BUG_ON(__handle_changed_spte(kvm, as_id, sptep, gfn, old_spte,
> + new_spte, level, shared), kvm);
> }
>
> static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> - u64 *raw_sptep = rcu_dereference(iter->sptep);
> -
> /*
> * The caller is responsible for ensuring the old SPTE is not a FROZEN
> * SPTE. KVM should never attempt to zap or manipulate a FROZEN SPTE,
> @@ -591,40 +592,6 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> */
> WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>
> - if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
> - int ret;
> -
> - /*
> - * KVM doesn't currently support zapping or splitting mirror
> - * SPTEs while holding mmu_lock for read.
> - */
> - if (KVM_BUG_ON(is_shadow_present_pte(iter->old_spte), kvm) ||
> - KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
> - return -EBUSY;
> -
> - /*
> - * Temporarily freeze the SPTE until the external PTE operation
> - * has completed, e.g. so that concurrent faults don't attempt
> - * to install a child PTE in the external page table before the
> - * parent PTE has been written.
> - */
> - if (!try_cmpxchg64(raw_sptep, &iter->old_spte, FROZEN_SPTE))
> - return -EBUSY;
> -
> - /*
> - * Update the external PTE. On success, set the mirror SPTE to
> - * the desired value. On failure, restore the old SPTE so that
> - * the SPTE isn't frozen in perpetuity.
> - */
> - ret = kvm_x86_call(set_external_spte)(kvm, iter->gfn, iter->old_spte,
> - new_spte, iter->level);
> - if (ret)
> - __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> - else
> - __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> - return ret;
> - }
> -
> /*
> * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> * does not hold the mmu_lock. On failure, i.e. if a different logical
> @@ -632,7 +599,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> * the current value, so the caller operates on fresh data, e.g. if it
> * retries tdp_mmu_set_spte_atomic()
> */
> - if (!try_cmpxchg64(raw_sptep, &iter->old_spte, new_spte))
> + if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
> return -EBUSY;
>
> return 0;
> @@ -663,14 +630,44 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>
> + /* KVM should never freeze SPTEs using higher level APIs. */
> + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
What about
KVM_MMU_WARN_ON(is_frozen_spte(new_spte) ||
is_frozen_spte(iter->old_spte) || iter->yielded);
> + /*
> + * Temporarily freeze the SPTE until the external PTE operation has
> + * completed (unless the new SPTE itself will be frozen), e.g. so that
> + * concurrent faults don't attempt to install a child PTE in the
> + * external page table before the parent PTE has been written, or try
> + * to re-install a page table before the old one was removed.
> + */
> + if (is_mirror_sptep(iter->sptep))
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> + else
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
and invoking open code try_cmpxchg64() directly?
> if (ret)
> return ret;
>
> - handle_changed_spte(kvm, iter->as_id, iter->sptep, iter->gfn,
> - iter->old_spte, new_spte, iter->level, true);
> + ret = __handle_changed_spte(kvm, iter->as_id, iter->sptep, iter->gfn,
> + iter->old_spte, new_spte, iter->level, true);
>
> - return 0;
> + /*
> + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> + * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> + * otherwise set the mirror SPTE to the new desired value.
> + */
> + if (is_mirror_sptep(iter->sptep)) {
> + if (ret)
> + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> + else
> + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> + } else {
> + /*
> + * Bug the VM if handling the change failed, as failure is only
> + * allowed if KVM couldn't update the external SPTE.
> + */
> + KVM_BUG_ON(ret, kvm);
> + }
> + return ret;
> }
One concern for tdp_mmu_set_spte_atomic() to handle mirror SPTEs:
- Previously
1. set *iter->sptep to FROZEN_SPTE.
2. kvm_x86_call(set_external_spte)(old_spte, new_spte)
3. set *iter->sptep to new_spte
- Now with this diff
1. set *iter->sptep to FROZEN_SPTE.
2. __handle_changed_spte()
--> kvm_x86_call(set_external_spte)(iter->sptep, old_spte, new_spte)
3. set *iter->sptep to new_spte
what if __handle_changed_spte() reads *iter->sptep in step 2?
Passing in "bool is_mirror_sp" to __handle_changed_spte() instead?
Powered by blists - more mailing lists