[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYVPN5M7QQwu/r/n@yzhao56-desk.sh.intel.com>
Date: Fri, 6 Feb 2026 10:17:27 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini
<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>, Kai Huang
<kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
Annapurve" <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>,
Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
to S-EPT in handle_changed_spte()
On Thu, Feb 05, 2026 at 02:33:16PM -0800, Sean Christopherson wrote:
> On Thu, Feb 05, 2026, Yan Zhao wrote:
> > On Wed, Feb 04, 2026 at 06:23:38PM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 04, 2026, Yan Zhao wrote:
> > > > On Wed, Jan 28, 2026 at 05:14:40PM -0800, Sean Christopherson wrote:
> > > > > @@ -590,10 +566,21 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > > > * the paging structure. Note the WARN on the PFN changing without the
> > > > > * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> > > > > * pages are kernel allocations and should never be migrated.
> > > > > + *
> > > > > + * When removing leaf entries from a mirror, immediately propagate the
> > > > > + * changes to the external page tables. Note, non-leaf mirror entries
> > > > > + * are handled by handle_removed_pt(), as TDX requires that all leaf
> > > > > + * entries are removed before the owning page table. Note #2, writes
> > > > > + * to make mirror PTEs shadow-present are propagated to external page
> > > > > + * tables by __tdp_mmu_set_spte_atomic(), as KVM needs to ensure the
> > > > > + * external page table was successfully updated before marking the
> > > > > + * mirror SPTE present.
> > > > > */
> > > > > if (was_present && !was_leaf &&
> > > > > (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > > > > handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> > > > > + else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf)
> > > > Should we check !is_present instead of !is_leaf?
> > > > e.g. a transition from a present leaf entry to a present non-leaf entry could
> > > > also trigger this if case.
> > >
> > > No, the !is_leaf check is very intentional. At this point in the series, S-EPT
> > > doesn't support hugepages. If KVM manages to install a leaf SPTE and replaces
> > > that SPTE with a non-leaf SPTE, then we absolutely want the KVM_BUG_ON() in
> > > tdx_sept_remove_private_spte() to fire:
> > >
> > > /* TODO: handle large pages. */
> > > if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > return -EIO;
> > But the op is named remove_external_spte().
> > And the check of "level != PG_LEVEL_4K" is for removing large leaf entries.
>
> I agree that the naming at this point in the series is unfortunate, but I don't
> see it as outright wrong. That the TDP MMU could theoretically replace the leaf
> SPTE with a non-leaf SPTE doesn't change the fact that the old leaf SPTE *is*
> being removed.
Hmm, I can't agree with that. But I won't insist if you think it's ok :)
> > Relying on this check is tricky and confusing.
>
> If it's still confusing at the end of the series, then I'm happy to discuss how
> we can make it less confusion. But as of this point in the series, I unfortunately
> don't see a better way to achieve my end goals (reducing the number of kvm_x86_ops
> hooks, and reducing how many TDX specific details bleed into common MMU code).
>
> There are "different" ways to incrementally move from where were at today, to where
> I want KVM to be, but I don't see them as "better". I.e. AFAICT, there's no way
> to move incrementally with reviewable patches while also maintaining perfect/ideal
> naming and flow.
Ok.
> > > And then later on, when S-EPT gains support for hugepages, "KVM: TDX: Add core
> > > support for splitting/demoting 2MiB S-EPT to 4KiB" doesn't need to touch code
> > > outside of arch/x86/kvm/vmx/tdx.c, because everything has already been plumbed
> > > in.
> > I haven't looked at the later patches for huge pages,
>
> Please do. As above, I don't think it's realistic to completely avoid some amount
> of "eww" in the intermediate stages.
WIP. Found some issues. Will comment after investigating further.
> > but plumbing here directly for splitting does not look right when it's
> > invoked under shared mmu_lock.
> > See the comment below.
> >
> > > > Besides, need "KVM_BUG_ON(shared, kvm)" in this case.
> > >
> > > Eh, we have lockdep_assert_held_write() in the S-EPT paths that require mmu_lock
> > > to be held for write. I don't think a KVM_BUG_ON() here would add meaningful
> > > value.
> > Hmm, I think KVM_BUG_ON(shared, kvm) is still useful.
> > If KVM invokes remove_external_spte() under shared mmu_lock, it needs to freeze
> > the entry first, similar to the sequence in __tdp_mmu_set_spte_atomic().
> >
> > i.e., invoking external x86 ops in handle_changed_spte() for mirror roots should
> > be !shared only.
>
> Sure, but...
>
> > Relying on the TDX code's lockdep_assert_held_write() for warning seems less
> > clear than having an explicit check here.
>
> ...that's TDX's responsibility to enforce, and I don't see any justification for
> something more than a lockdep assertion. As I've said elsewhere, several times,
My concern is that handle_changed_spte() can be invoked by callers other than
tdp_mmu_set_spte(). e.g.,
tdp_mmu_set_spte_atomic
| __tdp_mmu_set_spte_atomic
| | kvm_x86_call(set_external_spte)
| handle_changed_spte
| kvm_x86_call(set_external_spte)
When !is_frozen_spte(new_spte) and was_leaf, set_external_spte() may be invoked
twice for splitting under shared mmu_lock.
Therefore, I think it would be better to check for !shared and only invoke
set_external_spte() when !shared.
BTW: in the patch log, you mentioned that
: Invoke .remove_external_spte() in handle_changed_spte() as appropriate
: instead of relying on callers to do the right thing. Relying on callers
: to invoke .remove_external_spte() is confusing and brittle, e.g. subtly
: relies tdp_mmu_set_spte_atomic() never removing SPTEs, and removing an
: S-EPT entry in tdp_mmu_set_spte() is bizarre (yeah, the VM is bugged so
: it doesn't matter in practice, but it's still weird).
However, when tdp_mmu_set_spte_atomic() removes SPTEs in the future, it will
still need to follow the sequence in __tdp_mmu_set_spte_atomic() :
1. freeze, 2. set_external_spte(). 3. set the new_spte 4. handle_changed_spte().
So, do you think we should leave set_external_spte() in tdp_mmu_set_spte() for
exclusive mmu_lock scenarios instead of moving it to handle_changed_spte()?
> at some point we have to commit to getting the code right. Adding KVM_BUG_ON() in
> Every. Single. Call. does not yield more maintainable code. There are myriad
> things KVM can screw up, many of which have far, far more harmful impact than
> calling an S-EPT hook with mmu_lock held for read instead of write.
>
> The bar for adding a KVM_BUG_ON() is not simply "this shouldn't happen". It's,
> this shouldn't happen *and* at least one of (not a complete list):
>
> - we've either screwed this up badly more than once
> - it's really hard to get right
> - we might not notice if we do screw it up
> - KVM might corrupt data if we continue on
Thanks for sharing this bar.
Powered by blists - more mailing lists