linux-kernel - Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvUS+Cwg6DyA62EC@yzhao56-desk.sh.intel.com>
Date: Thu, 26 Sep 2024 15:53:28 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Isaku Yamahata <isaku.yamahata@...el.com>, <kvm@...r.kernel.org>,
	<sagis@...gle.com>, <chao.gao@...el.com>, <pbonzini@...hat.com>,
	<rick.p.edgecombe@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an
 interesting change

On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > KVM MMU behavior
> > ================
> > The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> > the present bits for simplicity.  The two functionalities and AD bits
> > support are related in this context.  unsync (manipulate D bit and W bit,
> > and handle write protect fault) and access tracking (manipulate A bit and
> > present bit, and hand handle page fault).  (We don't consider dirty page
> > tracking for now as it's future work of TDX KVM.)
> > 
> > * If AD bit is enabled:
> > D bit state change for dirty page tracking
> > On the first EPT violation without prefetch,
> > - D bits are set.
> > - Make SPTE writable as TDX supports only RXW (or if write fault).
> >   (TDX KVM doesn't support write protection at this state. It's future work.)
> > 
> > On the second EPT violation.
> > - clear D bits if write fault.
> 
> Heh, I was literally just writing changelogs for patches to address this (I told
> Sagi I would do it "this week"; that was four weeks ago).
> 
> This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).
> 
> Right now, the fixes for make_spte() are sitting toward the end of the massive
> kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> fairly confident that series can land in 6.13 (lots and lots of small patches).
> 
> ---
> Author:     Sean Christopherson <seanjc@...gle.com>
> AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> Commit:     Sean Christopherson <seanjc@...gle.com>
> CommitDate: Thu Sep 12 16:35:06 2024 -0700
> 
>     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
>     
>     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
>     leaf SPTE (with the same target pfn) and clears the Writable bit or the
>     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
>     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
>     if the existing SPTE is writable.
>     
>     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
>     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
>     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
>     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
>     the only SPTE that is dirty at the time of the next relevant clearing of
>     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
>     because it sees the D=0 SPTE, and thus will complete the clearing of the
>     dirty logs without performing a TLB flush.
But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
matter clear_dirty_gfn_range() finds a D bit or not.

kvm_mmu_slot_apply_flags
  |kvm_mmu_slot_leaf_clear_dirty
  |  |kvm_tdp_mmu_clear_dirty_slot
  |    |clear_dirty_gfn_range
  |kvm_flush_remote_tlbs_memslot
     
>     Opportunistically harden the TDP MMU against clearing the Writable bit as
>     well, both to prevent similar bugs for write-protection, but also so that
>     the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
>     the shadow MMU has similar logic).
>     
>     Fix the bug in the TDP MMU's page fault handler even though make_spte() is
>     clearly doing odd things, e.g. it marks the page dirty in its slot but
>     doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
>     another leaf SPTE is so rare, the cost of hardening KVM against such bugs
>     is negligible.  The make_spte() will be addressed in a future commit.
>     
>     Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
>     Cc: David Matlack <dmatlack@...gle.com>
>     Cc: stable@...r.kernel.org
>     Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..7c6d1c610f0e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>         else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>                 return RET_PF_RETRY;
>         else if (is_shadow_present_pte(iter->old_spte) &&
> -                !is_last_spte(iter->old_spte, iter->level))
> +                (!is_last_spte(iter->old_spte, iter->level) ||
> +                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> +                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
Should we also check is_accessed_spte() as what's done in mmu_spte_update()?

It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g.
the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl.

>                 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
>  
>         /*
> ---
> 
> >  arch/x86/kvm/mmu/spte.h    |  6 ++++++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> >  2 files changed, 31 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index a72f0e3bde17..1726f8ec5a50 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   */
> >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> >  
> > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > +	(shadow_acc_track_mask |			\
> > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> 
> Just make TDX require A/D bits, there's no reason to care about access tracking.
If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),
access tracking bit is possbile to be changed, as in below scenario:

1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
   a GFN, and make_spte() will call mark_spte_for_access_track() to
   remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
   SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
2. If a concurrent page fault occurs on the same GFN on another vCPU, then
   make_spte() in that vCPU will not see prefetch and the new_spte is
   with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.

> 
> > +	 shadow_dirty_mask | shadow_accessed_mask)
> > +
> >  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> >  static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
> >  
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 019b43723d90..cfb82ede8982 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> >  	int ret = 0;
> >  
> > -	KVM_BUG_ON(was_present, kvm);
> > -
> >  	lockdep_assert_held(&kvm->mmu_lock);
> >  	/*
> >  	 * We need to lock out other updates to the SPTE until the external
> > @@ -519,10 +517,34 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	 * external page table, or leaf.
> >  	 */
> >  	if (is_leaf) {
> > -		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		/*
> > +		 * SPTE is state machine with software available bits used.
> > +		 * Check if the change is interesting to the backend.
> > +		 */
> > +		if (!was_present)
> > +			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
> > +		else {
> > +			/*
> > +			 * The external PTEs don't need updates for some bits,
> > +			 * but if others are changed, bug the VM.
> > +			 */
> > +			if (KVM_BUG_ON(~EXTERNAL_SPTE_IGNORE_CHANGE_MASK &
> 
> There's no reason to bug the VM.  WARN, yes (maybe), but not bug the VM.  And I
> think this should be short-circuited somewhere further up the chain, i.e. not
> just for TDX.
> 
> One idea would be to WARN and skip setting the SPTE in tdp_mmu_map_handle_target_level().
> I.e. WARN and ignore 1=>0 transitions for Writable and Dirty bits, and then drop
> the TLB flush (assuming the above patch lands).
> 
> > +				       (old_spte ^ new_spte), kvm)) {
> 
> Curly braces are unnecessary.
> 
> > +				ret = -EIO;
> > +			}
> > +		}
> > +
> > +		/*
> > +		 * The backend shouldn't return an error except EAGAIN.
> > +		 * It's hard to debug without those info.
> > +		 */
> > +		if (ret && ret != EAGAIN)
> > +			pr_debug("gfn 0x%llx old_spte 0x%llx new_spte 0x%llx level %d\n",
> > +				 gfn, old_spte, new_spte, level);
> 
> Please no.  Not in upstream.  Yeah, for development it's fine, but sprinkling
> printks all over the MMU eventually just results in stale printks, e.g. see all
> of the pgprintk crud that got ripped out a while back.
> 
> >  	} else {
> >  		void *external_spt = get_external_spt(gfn, new_spte, level);
> >  
> > +		KVM_BUG_ON(was_present, kvm);
> >  		KVM_BUG_ON(!external_spt, kvm);
> >  		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
> >  	}
> > 
> > base-commit: d2c7662a6ea1c325a9ae878b3f1a265264bcd18b
> > -- 
> > 2.45.2
> > 
>