lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYUarHf3KEwHGuJe@google.com>
Date: Thu, 5 Feb 2026 14:33:16 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	Kiryl Shutsemau <kas@...nel.org>, Paolo Bonzini <pbonzini@...hat.com>, linux-kernel@...r.kernel.org, 
	linux-coco@...ts.linux.dev, kvm@...r.kernel.org, 
	Kai Huang <kai.huang@...el.com>, Rick Edgecombe <rick.p.edgecombe@...el.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, 
	Sagi Shahar <sagis@...gle.com>, Binbin Wu <binbin.wu@...ux.intel.com>, 
	Xiaoyao Li <xiaoyao.li@...el.com>, Isaku Yamahata <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH v5 08/45] KVM: x86/mmu: Propagate mirror SPTE removal
 to S-EPT in handle_changed_spte()

On Thu, Feb 05, 2026, Yan Zhao wrote:
> On Wed, Feb 04, 2026 at 06:23:38PM -0800, Sean Christopherson wrote:
> > On Wed, Feb 04, 2026, Yan Zhao wrote:
> > > On Wed, Jan 28, 2026 at 05:14:40PM -0800, Sean Christopherson wrote:
> > > > @@ -590,10 +566,21 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > >  	 * the paging structure.  Note the WARN on the PFN changing without the
> > > >  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
> > > >  	 * pages are kernel allocations and should never be migrated.
> > > > +	 *
> > > > +	 * When removing leaf entries from a mirror, immediately propagate the
> > > > +	 * changes to the external page tables.  Note, non-leaf mirror entries
> > > > +	 * are handled by handle_removed_pt(), as TDX requires that all leaf
> > > > +	 * entries are removed before the owning page table.  Note #2, writes
> > > > +	 * to make mirror PTEs shadow-present are propagated to external page
> > > > +	 * tables by __tdp_mmu_set_spte_atomic(), as KVM needs to ensure the
> > > > +	 * external page table was successfully updated before marking the
> > > > +	 * mirror SPTE present.
> > > >  	 */
> > > >  	if (was_present && !was_leaf &&
> > > >  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > > >  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> > > > +	else if (was_leaf && is_mirror_sptep(sptep) && !is_leaf)
> > > Should we check !is_present instead of !is_leaf?
> > > e.g. a transition from a present leaf entry to a present non-leaf entry could
> > > also trigger this if case.
> > 
> > No, the !is_leaf check is very intentional.  At this point in the series, S-EPT
> > doesn't support hugepages.  If KVM manages to install a leaf SPTE and replaces
> > that SPTE with a non-leaf SPTE, then we absolutely want the KVM_BUG_ON() in
> > tdx_sept_remove_private_spte() to fire:
> > 
> > 	/* TODO: handle large pages. */
> > 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > 		return -EIO;
> But the op is named remove_external_spte().
> And the check of "level != PG_LEVEL_4K" is for removing large leaf entries.

I agree that the naming at this point in the series is unfortunate, but I don't
see it as outright wrong.  That the TDP MMU could theoretically replace the leaf
SPTE with a non-leaf SPTE doesn't change the fact that the old leaf SPTE *is*
being removed.

> Relying on this check is tricky and confusing.

If it's still confusing at the end of the series, then I'm happy to discuss how
we can make it less confusion.  But as of this point in the series, I unfortunately
don't see a better way to achieve my end goals (reducing the number of kvm_x86_ops
hooks, and reducing how many TDX specific details bleed into common MMU code).

There are "different" ways to incrementally move from where were at today, to where
I want KVM to be, but I don't see them as "better".  I.e. AFAICT, there's no way
to move incrementally with reviewable patches while also maintaining perfect/ideal
naming and flow.

> > And then later on, when S-EPT gains support for hugepages, "KVM: TDX: Add core
> > support for splitting/demoting 2MiB S-EPT to 4KiB" doesn't need to touch code
> > outside of arch/x86/kvm/vmx/tdx.c, because everything has already been plumbed
> > in.
> I haven't looked at the later patches for huge pages,

Please do.  As above, I don't think it's realistic to completely avoid some amount
of "eww" in the intermediate stages.

> but plumbing here directly for splitting does not look right when it's
> invoked under shared mmu_lock.
> See the comment below.
>  
> > > Besides, need "KVM_BUG_ON(shared, kvm)" in this case.
> > 
> > Eh, we have lockdep_assert_held_write() in the S-EPT paths that require mmu_lock
> > to be held for write.  I don't think a KVM_BUG_ON() here would add meaningful
> > value.
> Hmm, I think KVM_BUG_ON(shared, kvm) is still useful.
> If KVM invokes remove_external_spte() under shared mmu_lock, it needs to freeze
> the entry first, similar to the sequence in __tdp_mmu_set_spte_atomic().
> 
> i.e., invoking external x86 ops in handle_changed_spte() for mirror roots should
> be !shared only.

Sure, but...

> Relying on the TDX code's lockdep_assert_held_write() for warning seems less
> clear than having an explicit check here.

...that's TDX's responsibility to enforce, and I don't see any justification for
something more than a lockdep assertion.  As I've said elsewhere, several times,
at some point we have to commit to getting the code right.  Adding KVM_BUG_ON() in
Every. Single. Call. does not yield more maintainable code.  There are myriad
things KVM can screw up, many of which have far, far more harmful impact than
calling an S-EPT hook with mmu_lock held for read instead of write.

The bar for adding a KVM_BUG_ON() is not simply "this shouldn't happen".  It's,
this shouldn't happen *and* at least one of (not a complete list):

  - we've either screwed this up badly more than once
  - it's really hard to get right
  - we might not notice if we do screw it up
  - KVM might corrupt data if we continue on

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ