lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4GjbCRgyBe7k9gw@google.com>
Date: Fri, 10 Jan 2025 14:47:08 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, David Matlack <dmatlack@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao <yuzhao@...gle.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn
 and kvm_age_gfn

On Tue, Nov 05, 2024, James Houghton wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index a24fca3f9e7f..f26d0b60d2dd 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  }
>  
>  /*
> - * SPTEs must be modified atomically if they are shadow-present, leaf
> - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> - * of mmu_lock.  The Writable bit can be set by KVM's fast page fault
> - * handler, and Accessed and Dirty bits can be set by the CPU.
> + * SPTEs must be modified atomically if they have bits that can be set outside
> + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> + * cleared by age_gfn_range().
>   *
>   * Note, non-leaf SPTEs do have Accessed bits and those bits are
>   * technically volatile, but KVM doesn't consume the Accessed bit of
> @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
>  static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
>  {
>  	return is_shadow_present_pte(old_spte) &&
> -	       is_last_spte(old_spte, level) &&
> -	       spte_has_volatile_bits(old_spte);
> +	       is_last_spte(old_spte, level);

I don't like this change on multiple fronts.  First and foremost, it results in
spte_has_volatile_bits() being wrong for the TDP MMU.  Second, the same logic
applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
the rmap lock.

Lastly, I'm very, very tempted to say we should omit Accessed state from
spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
thinking a page is old in rare cases, for the sake of performance.  The odds of
KVM clobbering the Accessed bit are probably smaller than the odds of missing an
Accessed update due to a stale TLB entry.

Note, only the shadow_accessed_mask check can be removed.  KVM needs to ensure
access-tracked SPTEs are zapped properly, and dirty logging can't have false
negatives.

>  }
>  
>  static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..f5b4f1060fff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>  		     ((_only_valid) && (_root)->role.invalid))) {		\
>  		} else
>  
> +/*
> + * Iterate over all TDP MMU roots in an RCU read-side critical section.

Heh, that's pretty darn obvious.  It would be far more helpful if the comment
explained the usage rules, e.g. what is safe (at a high level).

> + */
> +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id)			\
> +	list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link)		\
> +		if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) ||	\
> +		    (_root)->role.invalid) {					\
> +		} else
> +
>  #define for_each_tdp_mmu_root(_kvm, _root, _as_id)			\
>  	__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
>  
> @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
>  	u64 new_spte;
>  
>  	if (spte_ad_enabled(iter->old_spte)) {
> -		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> -							 iter->old_spte,
> -							 shadow_accessed_mask,
> -							 iter->level);
> +		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> +						shadow_accessed_mask);

Align, and let this poke past 80:

		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
								shadow_accessed_mask);

>  		new_spte = iter->old_spte & ~shadow_accessed_mask;
>  	} else {
>  		new_spte = mark_spte_for_access_track(iter->old_spte);
> -		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> -							iter->old_spte, new_spte,
> -							iter->level);
> +		/*
> +		 * It is safe for the following cmpxchg to fail. Leave the
> +		 * Accessed bit set, as the spte is most likely young anyway.
> +		 */
> +		(void)__tdp_mmu_set_spte_atomic(iter, new_spte);

Just a reminder that this needs to be:

		if (__tdp_mmu_set_spte_atomic(iter, new_spte))
			return;

>  	}
>  
>  	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> -- 
> 2.47.0.199.ga7371fff76-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ