linux-kernel - Re: [PATCH v5 4/9] mm: Add test_clear_young_fast

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HW3rZ5xgbyGa+FXk50QQzF4B1=sYL8zhBepj6tg0EiHYA@mail.gmail.com>
Date: Mon, 17 Jun 2024 09:50:26 -0700
From: James Houghton <jthoughton@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Paolo Bonzini <pbonzini@...hat.com>, Ankit Agrawal <ankita@...dia.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Catalin Marinas <catalin.marinas@....com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	James Morse <james.morse@....com>, Jonathan Corbet <corbet@....net>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, Raghavendra Rao Ananta <rananta@...gle.com>, 
	Ryan Roberts <ryan.roberts@....com>, Shaoqin Huang <shahuang@...hat.com>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Wei Xu <weixugc@...gle.com>, 
	Will Deacon <will@...nel.org>, Zenghui Yu <yuzenghui@...wei.com>, kvmarm@...ts.linux.dev, 
	kvm@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Fri, Jun 14, 2024, James Houghton wrote:
> > On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > >
> > > On Thu, Jun 13, 2024, James Houghton wrote:
> > > > I wonder if this still makes sense if whether or not an MMU is "fast"
> > > > is determined by how contended some lock(s) are at the time.
> > >
> > > No.  Just because a lock wasn't contended on the initial aging doesn't mean it
> > > won't be contended on the next round.  E.g. when using KVM x86's shadow MMU, which
> > > takes mmu_lock for write for all operations, an aging operation could get lucky
> > > and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> > > queue of operations.
> > >
> > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> > > an MMU that takes mmu_lock for read in all but the most rare paths.
> >
> > Aging and look-around themselves only use the fast-only notifiers, so
> > they won't ever wait on a lock (well... provided KVM is written like
> > that, which I think is a given).
>
> Regarding aging, is that actually the behavior that we want?  I thought the plan
> is to have the initial test look at all MMUs, i.e. be potentially slow, but only
> do the lookaround if it can be fast.  IIUC, that was Yu's intent (and peeking back
> at v2, that is indeed the case, unless I'm misreading the code).

I believe what I said is correct. There are three separate things going on here:

1. Aging (when we hit the low watermark, scan PTEs to find young pages)
2. Eviction (pick a page to evict; if it is definitely not young, evict it)
3. Look-around (upon finding a page is young upon attempted eviction,
check adjacent pages if they are young too)

(1) and (3) both use the fast-only notifier, (2) does not. In v2[1],
this is true, as in the (1) and (3) paths, the notifier being called
is the test_clear_young() notifier with the bitmap present. If the
bitmap is present, the shadow MMU is not consulted.

For (2), there is an mmu_notifier_clear_young() with no bitmap called
in should_look_around(), just like in this v5. (2) is the only one
that needs to use the slow notifier; the (1) and (3) are best-effort
attempts to improve the decision for which pages to evict.

[1]: https://lore.kernel.org/linux-mm/20230526234435.662652-11-yuzhao@google.com/

>
> If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will
> end up with hot pages (used by L2) swapped out.

The shadow MMU is consulted at eviction time -- only at eviction time.
So pages used by L2 won't be swapped out unless they're still cold at
eviction time.

In my (and Yu's) head, not being able to do aging for nested TDP is ok
because running nested VMs is much more rare than running non-nested
VMs. And in the non-nested case, being able to do aging is a strict
improvement over what we have now.

We could look into being able to do aging with the shadow MMU, but I
don't think that should necessarily block this series.

> Or are you saying that the "test" could be slow, but not "clear"?  That's also
> suboptimal, because any pages accessed by L2 will always appear hot.

No. I hope what I said above makes sense.

> > should_look_around() will use the slow notifier because it (despite its name)
> > is responsible for accurately determining if a page is young lest we evict a
> > young page.
> >
> > So in this case where "fast" means "lock not contended for now",
>
> No.  In KVM, "fast" needs to be a property of the MMU, not a reflection of system
> state at some random snapshot in time.

Ok, that's fine with me.

> > I don't think it's necessarily wrong for MGLRU to attempt to find young
> > pages, even if sometimes it will bail out because a lock is contended/held
>
> lru_gen_look_around() skips lookaround if something else is waiting on the page
> table lock, but that is a far cry from what KVM would be doing.  (a) the PTL is
> already held, and (b) it is scoped precisely to the range being processed.  Not
> looking around makes sense because there's a high probability of the PTEs in
> question being modified by a different task, i.e. of the look around being a
> waste of time.
>
> In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoid
> waiting, and would need to bail from the middle of the aging walk if a different
> task contends mmu_lock.
>
> I agree that's not "wrong", but in part because mmu_lock is scoped to the entire
> VM, it risks ending up with semi-random, hard to debug behavior.  E.g. a user
> could see intermittent "failures" that come and go based on seemingly unrelated
> behavior in KVM.  And implementing the "bail" behavior in the shadow MMU would
> require non-trivial changes.

This is a great point, thanks Sean. I won't send any patches that make
other architectures trylock() for the fast notifier.

>
> In other words, I would very strongly prefer that the shadow MMU be all or nothing,
> i.e. is either part of look-around or isn't.  And if nested TDP doesn't fair well
> with MGLRU, then we (or whoever cares) can spend the time+effort to make it work
> with fast-aging.
>
> Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the shadow
> MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or rather,
> I suspect/hope we can get close enough for an initial merge, which would allow
> aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
> because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.
>
> Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
> in such a way that a lockless walk would be painfully complex.  But if there is
> exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
> that one SPTE.  And with nested TDP, unless L1 is doing something uncommon, e.g.
> mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
> have only one entry.  That's not the case for legacy shadow paging because kernels
> almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
> along with any userspace mappings.
>
> E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn has
> multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory during
> boot).
>
> So, if we bifurcate aging behavior based on whether or not the TDP MMU is enabled,
> then whether or not aging is fast is constant (after KVM loads).  Rougly, the KVM
> side of things would be the below, plus a bunch of conversions to WRITE_ONCE() to
> ensure a stable rmap value (KVM already plays nice with lockless accesses to SPTEs,
> thanks to the fast page fault path).
>
> If KVM adds an rmap entry after the READ_ONCE(), then functionally all is still
> well because the original SPTE pointer is still valid.  If the rmap entry is
> removed, then KVM just needs to ensure the owning page table isn't freed.  That
> could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing page
> tables, and the rmap stuff doesn't actually walk upper level entries), or by
> enhancing the shadow MMU's lockless walk logic to allow lockless walks from
> non-vCPU tasks.
>
> And in the (hopefully) unlikely scenario someone has a use case where L1 maps a
> gfn into multiple L2s (or aliases in bizarre ways), then we can tackle making the
> nested TDP shadow MMU rmap walks always lockless.
>
> E.g. again very roughly, if we went with the latter:
>
> @@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
>  }
>
> +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
> +                                                         struct kvm_gfn_range *range,
> +                                                         typedefme handler)
> +{
> +       gfn_t gfn;
> +       int level;
> +       u64 *spte;
> +       bool ret;
> +
> +       walk_shadow_page_lockless_begin(???);
> +
> +       for (gfn = range->start; gfn < range->end; gfn++) {
> +               for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +                       spte = (void *)READ_ONCE(gfn_to_rmap(gfn, level, range->slot)->val);
> +
> +                       /* Skip the gfn if there are multiple SPTEs. */
> +                       if ((unsigned long)spte & 1)
> +                               continue;
> +
> +                       ret |= handler(spte);
> +               }
> +       }
> +
> +       walk_shadow_page_lockless_end(???);
> +}
> +
>  static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
>                          bool fast_only)
>  {
>         bool young = false;
>
> -       if (kvm_memslots_have_rmaps(kvm)) {
> -               if (fast_only)
> -                       return -1;
> -
> -               write_lock(&kvm->mmu_lock);
> -               young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> -               write_unlock(&kvm->mmu_lock);
> -       }
> -
> -       if (tdp_mmu_enabled)
> +       if (tdp_mmu_enabled) {
>                 young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> +               young |= kvm_handle_gfn_range_lockless(kvm, range, kvm_age_rmap_fast);
> +       } else if (!fast_only) {
> +               write_lock(&kvm->mmu_lock);
> +               young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> +               write_unlock(&kvm->mmu_lock);
> +       }
>
>         return (int)young;
>  }

Hmm, interesting. I need to spend a little bit more time digesting this.

Would you like to see this included in v6? (It'd be nice to avoid the
WAS_FAST stuff....) Should we leave it for a later series? I haven't
formed my own opinion yet.

> > for a few or even a majority of the pages. Not doing look-around is the same
> > as doing look-around and finding that no pages are young.
>
> No, because the former is deterministic and predictable, the latter is not.

Fair enough. I just meant in terms of the end result.

> > Anyway, I don't think this bit is really all that important unless we
> > can demonstrate that KVM participating like this actually results in a
> > measurable win.
>
> Participating like what?  You've lost me a bit.  Are we talking past each other?

Participating in aging and look-around by trylock()ing instead of
being lockless.

All I meant was: it's not important to try to figure out how to get
this non-deterministic trylock()-based "fast" notifier implementation
to work, because I'm not actually suggesting we actually do this.
There was some suggestion previously to make arm64 work like this (in
lieu of a lockless implementation), but I'd need to actually verify
that this performs/behaves well before actually sending a patch.

> What I am saying is that we do this (note that this is slightly different than
> an earlier sketch; I botched the ordering of spin_is_contend() in that one, and
> didn't account for the page being young in the primary MMU).
>
>         if (pte_young(ptep_get(pte)))
>                 young = 1 | MMU_NOTIFY_WAS_FAST;
>
>         young |= ptep_clear_young_notify(vma, addr, pte);
>         if (!young)
>                 return false;
>
>         if (!(young & MMU_NOTIFY_WAS_FAST))
>                 return true;
>
>         if (spin_is_contended(pvmw->ptl))
>                 return false;
>
>         /* exclude special VMAs containing anon pages from COW */
>         if (vma->vm_flags & VM_SPECIAL)
>                 return false;

SGTM. I think we're on the same page here.