linux-kernel - Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal to S-EPT in handle_changed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYVPN5M7QQwu/r/n@yzhao56-desk.sh.intel.com>
Date: Fri, 6 Feb 2026 10:17:27 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	<x86@...nel.org>, Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini
	<pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
	<linux-coco@...ts.linux.dev>, <kvm@...r.kernel.org>, Kai Huang
	<kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
	Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>,
	Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
 to S-EPT in handle_changed_spte()

On Thu, Feb 05, 2026 at 02:33:16PM -0800, Sean Christopherson wrote:
> On Thu, Feb 05, 2026, Yan Zhao wrote:
> > On Wed, Feb 04, 2026 at 06:23:38PM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 04, 2026, Yan Zhao wrote:
> > > > On Wed, Jan 28, 2026 at 05:14:40PM -0800, Sean Christopherson wrote:
> > > > > @@ -590,10 +566,21 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > > >  	 * the paging structure.  Note the WARN on the PFN changing without the
> > > > >  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
> > > > >  	 * pages are kernel allocations and should never be migrated.
> > > > > +	 *
> > > > > +	 * When removing leaf entries from a mirror, immediately propagate the
> > > > > +	 * changes to the external page tables.  Note, non-leaf mirror entries
> > > > > +	 * are handled by handle_removed_pt(), as TDX requires that all leaf
> > > > > +	 * entries are removed before the owning page table.  Note #2, writes
> > > > > +	 * to make mirror PTEs shadow-present are propagated to external page
> > > > > +	 * tables by __tdp_mmu_set_spte_atomic(), as KVM needs to ensure the
> > > > > +	 * external page table was successfully updated before marking the
> > > > > +	 * mirror SPTE present.
> > > > >  	 */
> > > > >  	if (was_present && !was_leaf &&
> > > > >  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > > > >  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> > > > > +	else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf)
> > > > Should we check !is_present instead of !is_leaf?
> > > > e.g. a transition from a present leaf entry to a present non-leaf entry could
> > > > also trigger this if case.
> > > 
> > > No, the !is_leaf check is very intentional.  At this point in the series, S-EPT
> > > doesn't support hugepages.  If KVM manages to install a leaf SPTE and replaces
> > > that SPTE with a non-leaf SPTE, then we absolutely want the KVM_BUG_ON() in
> > > tdx_sept_remove_private_spte() to fire:
> > > 
> > > 	/* TODO: handle large pages. */
> > > 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > > 		return -EIO;
> > But the op is named remove_external_spte().
> > And the check of "level != PG_LEVEL_4K" is for removing large leaf entries.
> 
> I agree that the naming at this point in the series is unfortunate, but I don't
> see it as outright wrong.  That the TDP MMU could theoretically replace the leaf
> SPTE with a non-leaf SPTE doesn't change the fact that the old leaf SPTE *is*
> being removed.
Hmm, I can't agree with that. But I won't insist if you think it's ok :)


> > Relying on this check is tricky and confusing.
> 
> If it's still confusing at the end of the series, then I'm happy to discuss how
> we can make it less confusion.  But as of this point in the series, I unfortunately
> don't see a better way to achieve my end goals (reducing the number of kvm_x86_ops
> hooks, and reducing how many TDX specific details bleed into common MMU code).
> 
> There are "different" ways to incrementally move from where were at today, to where
> I want KVM to be, but I don't see them as "better".  I.e. AFAICT, there's no way
> to move incrementally with reviewable patches while also maintaining perfect/ideal
> naming and flow.
Ok.

> > > And then later on, when S-EPT gains support for hugepages, "KVM: TDX: Add core
> > > support for splitting/demoting 2MiB S-EPT to 4KiB" doesn't need to touch code
> > > outside of arch/x86/kvm/vmx/tdx.c, because everything has already been plumbed
> > > in.
> > I haven't looked at the later patches for huge pages,
> 
> Please do.  As above, I don't think it's realistic to completely avoid some amount
> of "eww" in the intermediate stages.
WIP. Found some issues. Will comment after investigating further.

> > but plumbing here directly for splitting does not look right when it's
> > invoked under shared mmu_lock.
> > See the comment below.
> >  
> > > > Besides, need "KVM_BUG_ON(shared, kvm)" in this case.
> > > 
> > > Eh, we have lockdep_assert_held_write() in the S-EPT paths that require mmu_lock
> > > to be held for write.  I don't think a KVM_BUG_ON() here would add meaningful
> > > value.
> > Hmm, I think KVM_BUG_ON(shared, kvm) is still useful.
> > If KVM invokes remove_external_spte() under shared mmu_lock, it needs to freeze
> > the entry first, similar to the sequence in __tdp_mmu_set_spte_atomic().
> > 
> > i.e., invoking external x86 ops in handle_changed_spte() for mirror roots should
> > be !shared only.
> 
> Sure, but...
> 
> > Relying on the TDX code's lockdep_assert_held_write() for warning seems less
> > clear than having an explicit check here.
> 
> ...that's TDX's responsibility to enforce, and I don't see any justification for
> something more than a lockdep assertion.  As I've said elsewhere, several times,
My concern is that handle_changed_spte() can be invoked by callers other than
tdp_mmu_set_spte(). e.g.,

tdp_mmu_set_spte_atomic
  | __tdp_mmu_set_spte_atomic
  |     | kvm_x86_call(set_external_spte)
  | handle_changed_spte
        | kvm_x86_call(set_external_spte)

When !is_frozen_spte(new_spte) and was_leaf, set_external_spte() may be invoked
twice for splitting under shared mmu_lock.

Therefore, I think it would be better to check for !shared and only invoke
set_external_spte() when !shared.

BTW: in the patch log, you mentioned that

: Invoke .remove_external_spte() in handle_changed_spte() as appropriate
: instead of relying on callers to do the right thing.  Relying on callers
: to invoke .remove_external_spte() is confusing and brittle, e.g. subtly
: relies tdp_mmu_set_spte_atomic() never removing SPTEs, and removing an
: S-EPT entry in tdp_mmu_set_spte() is bizarre (yeah, the VM is bugged so
: it doesn't matter in practice, but it's still weird).

However, when tdp_mmu_set_spte_atomic() removes SPTEs in the future, it will
still need to follow the sequence in __tdp_mmu_set_spte_atomic() :
1. freeze, 2. set_external_spte(). 3. set the new_spte 4. handle_changed_spte().

So, do you think we should leave set_external_spte() in tdp_mmu_set_spte() for
exclusive mmu_lock scenarios instead of moving it to handle_changed_spte()?

> at some point we have to commit to getting the code right.  Adding KVM_BUG_ON() in
> Every. Single. Call. does not yield more maintainable code.  There are myriad
> things KVM can screw up, many of which have far, far more harmful impact than
> calling an S-EPT hook with mmu_lock held for read instead of write.
> 
> The bar for adding a KVM_BUG_ON() is not simply "this shouldn't happen".  It's,
> this shouldn't happen *and* at least one of (not a complete list):
> 
>   - we've either screwed this up badly more than once
>   - it's really hard to get right
>   - we might not notice if we do screw it up
>   - KVM might corrupt data if we continue on
Thanks for sharing this bar.