linux-kernel - Re: [PATCH v5 4/9] mm: Add test_clear_young_fast

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZmxsCwu4uP1lGsWz@google.com>
Date: Fri, 14 Jun 2024 09:12:59 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Paolo Bonzini <pbonzini@...hat.com>, Ankit Agrawal <ankita@...dia.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Catalin Marinas <catalin.marinas@....com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	James Morse <james.morse@....com>, Jonathan Corbet <corbet@....net>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, Raghavendra Rao Ananta <rananta@...gle.com>, 
	Ryan Roberts <ryan.roberts@....com>, Shaoqin Huang <shahuang@...hat.com>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Wei Xu <weixugc@...gle.com>, 
	Will Deacon <will@...nel.org>, Zenghui Yu <yuzenghui@...wei.com>, kvmarm@...ts.linux.dev, 
	kvm@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Thu, Jun 13, 2024, James Houghton wrote:
> On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > i.e. would be a minor optimization when KVM doesn't suppport fast aging.  But that's
> > probably a pretty unlikely combination, so it's probably not a valid argument.
> >
> > So, I guess I don't have a strong opinion?
> 
> (Sorry for the somewhat delayed response... spent some time actually
> writing what this would look like.)
> 
> I see what you mean, thanks! So has_fast_aging might be set by KVM if
> the architecture sets a Kconfig saying that it understands the concept
> of fast aging, basically what the presence of this v5's
> test_clear_young_fast_only() indicates.

It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
doesn't support fast aging (uses the shadow MMU even for TDP).

> > I don't understand where the "must check shadow MMU" in #4 comes from.  I also
> > don't think it's necessary; see below.
> 
> I just meant `kvm_has_shadow_mmu_sptes()` or
> `kvm_memslots_have_rmaps()`. I like the logic you suggest below. :)
> 
> > > Some of this reordering (and maybe a change from
> > > kvm_shadow_root_allocated() to checking indirect_shadow_pages or
> > > something else) can be done in its own patch.
> 
> So just to be clear, for test_young(), I intend to have a patch in v6
> to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> like a pure win; no reason not to include it if we're making logic
> changes here anyway.

I don't think that's correct.  The initial fast_only=false aging should process
shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
get a false positive on young due to failing to clear the Accessed bit in the
shadow MMU.  E.g. if page X is accessed by both L1 and L2, then aged, and never
accessed again, the Accessed bit would still be set in the page tables for L2.

My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to
communicate to MGLRU that the page was found to be young in an MMU that supports
fast aging, i.e. that looking around at other SPTEs is worth doing.

> > > > So rather than failing the fast aging, I think what we want is to know if an
> > > > mmu_notifier found a young SPTE during a fast lookup.  E.g. something like this
> > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > > > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > > > (very rare) false negative is acceptable.
> > > >
> > > >   static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > > >   {
> > > >         return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > > >   }
> > > >
> > > >   static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > > >                          bool fast_only)
> > > >   {
> > > >         int young = 0;
> > > >
> > > >         if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > > >                 write_lock(&kvm->mmu_lock);
> > > >                 young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > >                 write_unlock(&kvm->mmu_lock);
> > > >         }
> > > >
> > > >         if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > > >                 young = 1 | MMU_NOTIFY_WAS_FAST;
> 
> The most straightforward way (IMHO) to return something like `1 |
> MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make
> gfn_handler_t return int instead of bool.

Hrm, all the options are unpleasant.  Modifying gfn_handler_t to return an int
will require an absurd amount of churn (all implementations in all archictures),
and I don't love that the APIs that return true/false to indicate "flush" would
lose their boolean-ness.

One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then
refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have
it morph the KVM-internal flag into an MMU_NOTIFIER flag.  It's not perect either,
but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in
common KVM code.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b9d2633a931..c11a359b6ff5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
        unsigned long attributes;
+       bool aging_was_fast;
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 436ca41f61e5..a936f6bedd97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
        return __kvm_handle_hva_range(kvm, &range).ret;
 }
 
-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
-                                                        unsigned long start,
-                                                        unsigned long end,
-                                                        gfn_handler_t handler)
+static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
+                                            unsigned long start,
+                                            unsigned long end,
+                                            bool flush_if_young)
 {
        struct kvm *kvm = mmu_notifier_to_kvm(mn);
        const struct kvm_mmu_notifier_range range = {
@@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
                .end            = end,
                .handler        = handler,
                .on_lock        = (void *)kvm_null_fn,
-               .flush_on_ret   = false,
+               .flush_on_ret   = flush_if_young,
                .may_block      = false,
+               .aging_was_fast = false,
        };
 
-       return __kvm_handle_hva_range(kvm, &range).ret;
+       bool young = __kvm_handle_hva_range(kvm, &range).ret;
+
+       return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0);
 }
 
 void kvm_mmu_invalidate_begin(struct kvm *kvm)
@@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 {
        trace_kvm_age_hva(start, end);
 
-       return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
+       return kvm_age_hva_range(mn, start, end, true);
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
 {
        trace_kvm_age_hva(start, end);
 
-       /*
-        * Even though we do not flush TLB, this will still adversely
-        * affect performance on pre-Haswell Intel EPT, where there is
-        * no EPT Access Bit to clear so that we have to tear down EPT
-        * tables instead. If we find this unacceptable, we can always
-        * add a parameter to kvm_age_hva so that it effectively doesn't
-        * do anything on clear_young.
-        *
-        * Also note that currently we never issue secondary TLB flushes
-        * from clear_young, leaving this job up to the regular system
-        * cadence. If we find this inaccurate, we might come up with a
-        * more sophisticated heuristic later.
-        */
-       return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+       return kvm_age_hva_range(mn, start, end, false);
 }
 
 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
 {
        trace_kvm_test_age_hva(address);
 
-       return kvm_handle_hva_range_no_flush(mn, address, address + 1,
-                                            kvm_test_age_gfn);
+       return kvm_age_hva_range(mn, address, address + 1, false);
 }
 
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,


> > The change, relative to v5, that I am proposing is that MGLRU looks around if
> > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
> > only if _all_ secondary MMUs are fast.
> >
> > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
> > fast_only flag.
> 
> Oh, yeah, that's a lot more intelligent than what I had. I think I
> fully understand your suggestion; I guess we'll see in v6. :)
> 
> I wonder if this still makes sense if whether or not an MMU is "fast"
> is determined by how contended some lock(s) are at the time.

No.  Just because a lock wasn't contended on the initial aging doesn't mean it
won't be contended on the next round.  E.g. when using KVM x86's shadow MMU, which
takes mmu_lock for write for all operations, an aging operation could get lucky
and sneak in while mmu_lock happened to be free, but then get stuck behind a large
queue of operations.

The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
an MMU that takes mmu_lock for read in all but the most rare paths.