[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HW=eDMWadouN2x5DzDzg6yEX0sbFjacZRnRDUGzRDPJ2Q@mail.gmail.com>
Date: Mon, 27 Jan 2025 11:52:28 -0800
From: James Houghton <jthoughton@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, David Matlack <dmatlack@...gle.com>,
David Rientjes <rientjes@...gle.com>, Marc Zyngier <maz@...nel.org>,
Oliver Upton <oliver.upton@...ux.dev>, Wei Xu <weixugc@...gle.com>, Yu Zhao <yuzhao@...gle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn
and kvm_age_gfn
On Fri, Jan 10, 2025 at 2:47 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Tue, Nov 05, 2024, James Houghton wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index a24fca3f9e7f..f26d0b60d2dd 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> > }
> >
> > /*
> > - * SPTEs must be modified atomically if they are shadow-present, leaf
> > - * SPTEs, and have volatile bits, i.e. has bits that can be set outside
> > - * of mmu_lock. The Writable bit can be set by KVM's fast page fault
> > - * handler, and Accessed and Dirty bits can be set by the CPU.
> > + * SPTEs must be modified atomically if they have bits that can be set outside
> > + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
> > + * Writable bit can be set by KVM's fast page fault handler, the Accessed and
> > + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be
> > + * cleared by age_gfn_range().
> > *
> > * Note, non-leaf SPTEs do have Accessed bits and those bits are
> > * technically volatile, but KVM doesn't consume the Accessed bit of
> > @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
> > static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
> > {
> > return is_shadow_present_pte(old_spte) &&
> > - is_last_spte(old_spte, level) &&
> > - spte_has_volatile_bits(old_spte);
> > + is_last_spte(old_spte, level);
>
> I don't like this change on multiple fronts. First and foremost, it results in
> spte_has_volatile_bits() being wrong for the TDP MMU. Second, the same logic
> applies to the shadow MMU; the rmap lock prevents a use-after-free of the page
> that owns the SPTE, but the zapping of the SPTE happens before the writer grabs
> the rmap lock.
Thanks Sean, yes I forgot about the shadow MMU case.
> Lastly, I'm very, very tempted to say we should omit Accessed state from
> spte_has_volatile_bits() and rename it to something like spte_needs_atomic_write().
> KVM x86 no longer flushes TLBs on aging, so we're already committed to incorrectly
> thinking a page is old in rare cases, for the sake of performance. The odds of
> KVM clobbering the Accessed bit are probably smaller than the odds of missing an
> Accessed update due to a stale TLB entry.
>
> Note, only the shadow_accessed_mask check can be removed. KVM needs to ensure
> access-tracked SPTEs are zapped properly, and dirty logging can't have false
> negatives.
I've dropped the change to kvm_tdp_mmu_spte_need_atomic_write() and
instead applied this diff.
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte)
return true;
if (spte_ad_enabled(spte)) {
- if (!(spte & shadow_accessed_mask) ||
- (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+ /*
+ * Do not check the Accessed bit. It can be set (by the CPU)
+ * and cleared (by kvm_tdp_mmu_age_spte()) without holding
+ * the mmu_lock, but when clearing the Accessed bit, we do
+ * not invalidate the TLB, so we can already miss Accessed bit
+ * updates.
+ */
+ if (is_writable_pte(spte) && !(spte & shadow_dirty_mask))
return true;
}
I've also included a new patch to rename spte_has_volatile_bits() to
spte_needs_atomic_write() like you suggested. I merely renamed it in
all locations, including documentation; I haven't reworded the
documentation's use of the word "volatile."
>
> > }
> >
> > static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4508d868f1cd..f5b4f1060fff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > ((_only_valid) && (_root)->role.invalid))) { \
> > } else
> >
> > +/*
> > + * Iterate over all TDP MMU roots in an RCU read-side critical section.
>
> Heh, that's pretty darn obvious. It would be far more helpful if the comment
> explained the usage rules, e.g. what is safe (at a high level).
How's this?
+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ * It is safe to iterate over the SPTEs under the root, but their values will
+ * be unstable, so all writes must be atomic. As this routine is meant to be
+ * used without holding the mmu_lock at all, any bits that are flipped must
+ * be reflected in kvm_tdp_mmu_spte_need_atomic_write().
+ */
> > + */
> > +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id) \
> > + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \
> > + if ((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
> > + (_root)->role.invalid) { \
> > + } else
> > +
> > #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
> > __for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
> >
> > @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
> > u64 new_spte;
> >
> > if (spte_ad_enabled(iter->old_spte)) {
> > - iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> > - iter->old_spte,
> > - shadow_accessed_mask,
> > - iter->level);
> > + iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> > + shadow_accessed_mask);
>
> Align, and let this poke past 80:
>
> iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> shadow_accessed_mask);
Done.
> > new_spte = iter->old_spte & ~shadow_accessed_mask;
> > } else {
> > new_spte = mark_spte_for_access_track(iter->old_spte);
> > - iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
> > - iter->old_spte, new_spte,
> > - iter->level);
> > + /*
> > + * It is safe for the following cmpxchg to fail. Leave the
> > + * Accessed bit set, as the spte is most likely young anyway.
> > + */
> > + (void)__tdp_mmu_set_spte_atomic(iter, new_spte);
>
> Just a reminder that this needs to be:
>
> if (__tdp_mmu_set_spte_atomic(iter, new_spte))
> return;
>
Already applied, thanks!
> > }
> >
> > trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
> > --
> > 2.47.0.199.ga7371fff76-goog
> >
Powered by blists - more mailing lists