[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYvmlBb6oR3lfWn2@yzhao56-desk.sh.intel.com>
Date: Wed, 11 Feb 2026 10:16:52 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini
<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>, Kai Huang
<kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
Annapurve" <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>,
Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
to S-EPT in handle_changed_spte()
On Tue, Feb 10, 2026 at 11:52:09AM -0800, Sean Christopherson wrote:
> > For TDX's future implementation of set_external_spte() for split splitting,
> > could we add a new param "bool shared" to op set_external_spte() in the
> > future? i.e.,
> > - when tdx_sept_split_private_spte() is invoked under write mmu_lock, it calls
> > tdh_do_no_vcpus() to retry BUSY error, and TDX_BUG_ON_2() then.
> > - when tdx_sept_split_private_spte() is invoked under read mmu_lock
> > (in the future when calling tdh_mem_range_block() in unnecessary), it could
> > directly return BUSY to TDP MMU on contention.
>
> Yeah, I have no objection to using @shared for things like that.
Great.
> > > + return 0;
> > > +}
> > > +
> > > +static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > + gfn_t gfn, u64 old_spte, u64 new_spte,
> > > + int level, bool shared)
> > > +{
> > Do we need "WARN_ON_ONCE(is_mirror_sptep(sptep) && shared)" here ?
>
> No, because I want to call this code for all paths, including the fault path.
Hmm. IIUC, handle_changed_spte() can't be invoked for mirror root under read
mmu_lock.
For read mmu_lock + mirror scenarios, they need to invoke
tdp_mmu_set_spte_atomic() --> __handle_changed_spte().
> > > @@ -663,14 +630,44 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > >
> > > lockdep_assert_held_read(&kvm->mmu_lock);
> > >
> > > - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > >
> > > + /* KVM should never freeze SPTEs using higher level APIs. */
> > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > What about
> > KVM_MMU_WARN_ON(is_frozen_spte(new_spte) ||
> > is_frozen_spte(iter->old_spte) || iter->yielded);
> >
> > > + /*
> > > + * Temporarily freeze the SPTE until the external PTE operation has
> > > + * completed (unless the new SPTE itself will be frozen), e.g. so that
> > > + * concurrent faults don't attempt to install a child PTE in the
> > > + * external page table before the parent PTE has been written, or try
> > > + * to re-install a page table before the old one was removed.
> > > + */
> > > + if (is_mirror_sptep(iter->sptep))
> > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > + else
> > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > and invoking open code try_cmpxchg64() directly?
>
> No, because __tdp_mmu_set_spte_atomic() is still used by kvm_tdp_mmu_age_spte(),
> and the yielded/frozen rules apply there as well.
I see.
> > > + /*
> > > + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> > > + * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> > > + * otherwise set the mirror SPTE to the new desired value.
> > > + */
> > > + if (is_mirror_sptep(iter->sptep)) {
> > > + if (ret)
> > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> > > + else
> > > + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > + } else {
> > > + /*
> > > + * Bug the VM if handling the change failed, as failure is only
> > > + * allowed if KVM couldn't update the external SPTE.
> > > + */
> > > + KVM_BUG_ON(ret, kvm);
> > > + }
> > > + return ret;
> > > }
> > One concern for tdp_mmu_set_spte_atomic() to handle mirror SPTEs:
> > - Previously
> > 1. set *iter->sptep to FROZEN_SPTE.
> > 2. kvm_x86_call(set_external_spte)(old_spte, new_spte)
> > 3. set *iter->sptep to new_spte
> >
> > - Now with this diff
> > 1. set *iter->sptep to FROZEN_SPTE.
> > 2. __handle_changed_spte()
> > --> kvm_x86_call(set_external_spte)(iter->sptep, old_spte, new_spte)
>
> Note, iter->sptep isn't passed to set_external_spte(), the invocation for that is:
>
> return kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
> new_spte, level);
Oh, sorry. It should be
2. __handle_changed_spte(iter->sptep, old_spte, new_spte)
--> kvm_x86_call(set_external_spte)(old_spte, new_spte)
My concern is that what we pass to __handle_changed_spte() here are
"iter->sptep, iter->old_spte, new_spte".
ret = __handle_changed_spte(kvm, iter->as_id, iter->sptep, iter->gfn,
iter->old_spte, new_spte, iter->level, true);
i.e., new_spte is the target value which we haven't written it to iter->sptep
yet. We'll write the target value new_spte to iter->sptep after
__handle_changed_spte() succeeds. So, upon invoking __handle_changed_spte() in
tdp_mmu_set_spte_atomic(), iter->sptep just holds value FROZEN_SPTE.
So, re-reading iter->sptep will get a different value (which is FROZEN_SPTE)
from the new_spte passed to __handle_changed_spte().
Besides, __handle_changed_spte() contains code like
"kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);", which may have
incorrectly updated the stats even if kvm_x86_call(set_external_spte)() fails
later and the new_spte is never written to iter->sptep.
> > 3. set *iter->sptep to new_spte
> >
> > what if __handle_changed_spte() reads *iter->sptep in step 2?
>
> For the most part, "don't do that". There are an infinite number of "what ifs".
> I agree that re-reading iter->sptep is slightly more likely than other "what ifs",
> but then if we convert to a boolean it creates the "what if we swap the order of
> @as_id and @is_mirror_sp"? Given that @old_spte is provided, IMO re-reading the
> SPTE from memory will stand out.
As my above concern, re-reading SPTE in __handle_changed_spte() will just get
value FROZEN_SPTE instead of the value of new_spte.
> That said, I think we can have the best of both worlds. Rather than pass @as_id
> and @sptep, pass the @sp, i.e. the owning kvm_mmu_page. That would address your
> concern about re-reading the sptep, without needing another boolean.
Hmm, my intention of passing boolean is to avoid re-reading sptep, because
in step 2, we pass new_spte instead of the real value in sptep (which is
FROZEN_SPTE for mirror sp) to __handle_changed_spte().
So, passing @sp may not help?
Powered by blists - more mailing lists