lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtiBJydMqwkTaoOM@google.com>
Date: Wed, 4 Sep 2024 08:48:39 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Oliver Upton <oliver.upton@...ux.dev>, Marc Zyngier <maz@...nel.org>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding
 mmu_lock when aging gfns

On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@...gle.com> wrote:
> >
> > DO NOT MERGE, yet...
> >
> > Cc: James Houghton <jthoughton@...gle.com>
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 59 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 48e8608c2738..9df6b465de06 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> >   * locking is the same, but the caller is disallowed from modifying the rmap,
> >   * and so the unlock flow is a nop if the rmap is/was empty.
> >   */
> > -__maybe_unused
> >  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
> >  {
> >         return __kvm_rmap_lock(rmap_head);
> >  }
> >
> > -__maybe_unused
> >  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> >                                      unsigned long old_val)
> >  {
> > @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> >         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
> >  }
> >
> > -static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > -                                  struct kvm_gfn_range *range, bool test_only)
> > +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
> > +                                           struct kvm_gfn_range *range,
> > +                                           bool test_only)
> > +{
> > +       struct kvm_rmap_head *rmap_head;
> > +       struct rmap_iterator iter;
> > +       unsigned long rmap_val;
> > +       bool young = false;
> > +       u64 *sptep;
> > +       gfn_t gfn;
> > +       int level;
> > +       u64 spte;
> > +
> > +       for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +               for (gfn = range->start; gfn < range->end;
> > +                    gfn += KVM_PAGES_PER_HPAGE(level)) {
> > +                       rmap_head = gfn_to_rmap(gfn, level, range->slot);
> > +                       rmap_val = kvm_rmap_lock_readonly(rmap_head);
> > +
> > +                       for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
> > +                               if (!is_accessed_spte(spte))
> > +                                       continue;
> > +
> > +                               if (test_only) {
> > +                                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +                                       return true;
> > +                               }
> > +
> > +                               /*
> > +                                * Marking SPTEs for access tracking outside of
> > +                                * mmu_lock is unsupported.  Report the page as
> > +                                * young, but otherwise leave it as-is.
> 
> Just for my own understanding, what's the main reason why it's unsafe

Note, I specifically said "unsupported", not "unsafe" :-D

> to mark PTEs for access tracking outside the mmu_lock?

It probably can be done safely?  The main issue is that marking the SPTE for
access tracking can also clear the Writable bit, and so we'd need to audit all
the flows that consume is_writable_pte().  Hmm, actually, that's less scary than
it first seems, because thanks to kvm_mmu_notifier_clear_young(), KVM already
clears the Writable bit in AD-disabled SPTEs without a TLB flush.  E.g.
mmu_spte_update() specifically looks at MMU-writable, not the Writable bit, when
deciding if a TLB flush is required.

On a related note, I missed is that KVM would need to leaf SPTEs as volatile at
all times, as your MGLRU series modified kvm_tdp_mmu_spte_need_atomic_write(),
not the common spte_has_volatile_bits().

Actually, on second though, maybe it isn't necessary for the AD-enabled case.
Effectively clobbering the Accessed bit is completely fine, as aging is tolerant
of false negatives and false positives, so long as they aren't excessive.  And
that's doubly true if KVM x86 follows MM and doesn't force a TLB flush[1]

Oooh, and triply true if KVM stops marking the folio accesses when zapping SPTEs[2].

So yeah, after thinking though all of the moving parts, maybe we should commit
to aging AD-disabled SPTEs out of mmu_lock.  AD-disabled leaf SPTEs do end up being
"special", because KVM needs to ensure it doesn't clobber the Writable bit, i.e.
AD-disabled leaf SPTEs need to be treated as volatile at all times.  But in practice,
forcing an atomic update for all AD-disabled leaf SPTEs probably doesn't actually
change much, because in most cases KVM is probably using an atomic access anyways,
e.g. because KVM is clearing the Writable bit and the Writable bit is already volatile.

FWIW, marking the folio dirty if the SPTE was writable, as is done today in
mmu_spte_age(), is sketchy if mmu_lock isn't held, but probably ok since this is
invoked from an mmu_notifier and presumably the caller holds a reference to the
page/folio.  But that's largely a moot point since I want to yank out that code
anyways[3].

[1] https://lore.kernel.org/all/ZsS_OmxwFzrqDcfY@google.com
[2] https://lore.kernel.org/all/20240726235234.228822-82-seanjc@google.com
[3] https://lore.kernel.org/all/20240726235234.228822-8-seanjc@google.com

> > +                               if (spte_ad_enabled(spte))
> > +                                       clear_bit((ffs(shadow_accessed_mask) - 1),
> > +                                                 (unsigned long *)sptep);
> 
> I feel like it'd be kinda nice to de-duplicate this clear_bit() piece
> with the one in kvm_rmap_age_gfn_range().

Ya, definitely no argument against adding a helper.

> > +                               young = true;
> > +                       }
> > +
> > +                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +               }
> > +       }
> > +       return young;
> > +}
> > +
> > +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                    struct kvm_gfn_range *range, bool test_only)
> >  {
> >         struct slot_rmap_walk_iterator iterator;
> >         struct rmap_iterator iter;
> > @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> >         return young;
> >  }
> >
> > +
> > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                  struct kvm_gfn_range *range, bool test_only)
> > +{
> > +       /* FIXME: This also needs to be guarded with something like range->fast_only. */
> > +       if (kvm_ad_enabled())
> 
> I expect this to be something like `if (kvm_ad_enabled() ||
> range->fast_only)`. With MGLRU, that means the pages will always be the last
> candidates for eviction, though it is still possible for them to be evicted
> (though I think this would basically never happen). I think this is fine.
> 
> I think the only other possible choice is to record/return 'not young'/false
> instead of 'young'/true if the spte is young but !spte_ad_enabled(). That
> doesn't seem to be obviously better, though we *will* get correct age
> information at eviction time, when !range->fast_only, at which point the page
> will not be evicted, and Accessed will be cleared.

As above, I think the simpler solution overall is to support aging AD-disabled
SPTEs out of mmu_lock.  The sequence of getting to that end state will be more
complex, but most of that complexity is going to happen irrespective of this series.
And it would mean KVM MGLRU support has no chance of landing in 6.12, but again
I think that's the reality either way.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ