linux-kernel - Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HW44Hx_Ejx_6+FVKt1V17PdgT6rw+sNtKzumqc9UCVDfA@mail.gmail.com>
Date: Mon, 3 Jun 2024 15:45:59 -0700
From: James Houghton <jthoughton@...gle.com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Sean Christopherson <seanjc@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Paolo Bonzini <pbonzini@...hat.com>, Albert Ou <aou@...s.berkeley.edu>, 
	Ankit Agrawal <ankita@...dia.com>, Anup Patel <anup@...infault.org>, 
	Atish Patra <atishp@...shpatra.org>, Axel Rasmussen <axelrasmussen@...gle.com>, 
	Bibo Mao <maobibo@...ngson.cn>, Catalin Marinas <catalin.marinas@....com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	Huacai Chen <chenhuacai@...nel.org>, James Morse <james.morse@....com>, 
	Jonathan Corbet <corbet@....net>, Marc Zyngier <maz@...nel.org>, Michael Ellerman <mpe@...erman.id.au>, 
	Nicholas Piggin <npiggin@...il.com>, Oliver Upton <oliver.upton@...ux.dev>, 
	Palmer Dabbelt <palmer@...belt.com>, Paul Walmsley <paul.walmsley@...ive.com>, 
	Raghavendra Rao Ananta <rananta@...gle.com>, Ryan Roberts <ryan.roberts@....com>, 
	Shaoqin Huang <shahuang@...hat.com>, Shuah Khan <shuah@...nel.org>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Tianrui Zhao <zhaotianrui@...ngson.cn>, 
	Will Deacon <will@...nel.org>, Zenghui Yu <yuzenghui@...wei.com>, kvm-riscv@...ts.infradead.org, 
	kvm@...r.kernel.org, kvmarm@...ts.linux.dev, 
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	linux-mips@...r.kernel.org, linux-mm@...ck.org, 
	linux-riscv@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org, 
	loongarch@...ts.linux.dev
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate
 in aging

On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@...gle.com> wrote:
>
> On Wed, May 29, 2024 at 7:08 PM James Houghton <jthoughton@...gle.com> wrote:
> >
> > Hi Yu, Sean,
> >
> > Perhaps I "simplified" this bit of the series a little bit too much.
> > Being able to opportunistically do aging with KVM (even without
> > setting the Kconfig) is valuable.
> >
> > IIUC, we have the following possibilities:
> > - v4: aging with KVM is done if the new Kconfig is set.
> > - v3: aging with KVM is always done.
>
> This is not true -- in v3, MGLRU only scans secondary MMUs if it can
> be done locklessly on x86. It uses a bitmap to imply this requirement.
>
> > - v2: aging with KVM is done when the architecture reports that it can
> > probably be done locklessly, set at KVM MMU init time.
>
> Not really -- it's only done if it can be done locklessly on both x86 and arm64.
>
> > - Another possibility?: aging with KVM is only done exactly when it
> > can be done locklessly (i.e., mmu_notifier_test/clear_young() called
> > such that it will not grab any locks).
>
> This is exactly the case for v2.

Thanks for clarifying; sorry for getting this wrong.

>
> > I like the v4 approach because:
> > 1. We can choose whether or not to do aging with KVM no matter what
> > architecture we're using (without requiring userspace be aware to
> > disable the feature at runtime with sysfs to avoid regressing
> > performance if they don't care about proactive reclaim).
> > 2. If we check the new feature bit (0x8) in sysfs, we can know for
> > sure if aging is meant to be working or not. The selftest changes I
> > made won't work properly unless there is a way to be sure that aging
> > is working with KVM.
>
> I'm not convinced, but it doesn't mean your point of view is invalid.
> If you fully understand the implications of your design choice and
> document them, I will not object.
>
> All optimizations in v2 were measured step by step. Even that bitmap,
> which might be considered overengineered, brought a readily
> measuarable 4% improvement in memcached throughput on Altra Max
> swapping to Optane:
>
> Using the bitmap (64 KVM PTEs for each call)
> ============================================================================================================================
> Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50
> Latency     p99 Latency   p99.9 Latency       KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets            0.00          ---          ---             ---
>     ---             ---             ---         0.00
> Gets      1012801.92    431436.92     14965.11         0.06246
> 0.04700         0.16700         4.31900     39635.83
> Waits           0.00          ---          ---             ---
>     ---             ---             ---          ---
> Totals    1012801.92    431436.92     14965.11         0.06246
> 0.04700         0.16700         4.31900     39635.83
>
>
> Not using the bitmap (1 KVM PTEs for each call)
> ============================================================================================================================
> Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50
> Latency     p99 Latency   p99.9 Latency       KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets            0.00          ---          ---             ---
>     ---             ---             ---         0.00
> Gets       968210.02    412443.85     14303.89         0.06517
> 0.04700         0.15900         7.42300     37890.74
> Waits           0.00          ---          ---             ---
>     ---             ---             ---          ---
> Totals     968210.02    412443.85     14303.89         0.06517
> 0.04700         0.15900         7.42300     37890.74
>
>
> FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached.
>
> What I don't think is acceptable is simplifying those optimizations
> out without documenting your justifications (I would even call it a
> design change, rather than simplification, from v3 to v4).

I'll put back something similar to what you had before (like a
test_clear_young() with a "fast" parameter instead of "bitmap"). I
like the idea of having a new mmu notifier, like
fast_test_clear_young(), while leaving test_young() and clear_young()
unchanged (where "fast" means "prioritize speed over accuracy"). It
seems a little more straightforward that way.

>
> > For look-around at eviction time:
> > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
> > - v2/v3: done if the main mm PTE was young or (the SPTE was young and
> > the MMU notifier was lockless/fast).
>
> The host and secondary MMUs are two *independent* cases, IMO:
> 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young.
> 2. lookaround the secondary MMU if it can be done locklessly.
>
> So the v2/v3 behavior sounds a lot more reasonable to me.

I'll restore the v2/v3 behavior. I initially removed it because,
without batching, we (mostly) lose the spatial locality that, IIUC,
look-around is designed to exploit.

>
> Also a nit -- don't use 'else' in the following case (should_look_around()):
>
>   if (foo)
>     return bar;
>   else
>     do_something();

Oh, yes, sorry. I wrote and rewrote should_look_around() quite a few
times while trying to figure out what made sense in a no-batching
series. I'll fix this.

>
> > I made this logic change as part of removing batching.
> >
> > I'd really appreciate guidance on what the correct thing to do is.
> >
> > In my mind, what would work great is: by default, do aging exactly
> > when KVM can do it locklessly, and then have a Kconfig to always have
> > MGLRU to do aging with KVM if a user really cares about proactive
> > reclaim (when the feature bit is set). The selftest can check the
> > Kconfig + feature bit to know for sure if aging will be done.
>
> I still don't see how that Kconfig helps. Or why the new static branch
> isn't enough?

Without a special Kconfig, the feature bit just tells us that aging
with KVM is possible, not that it will necessarily be done. For the
self-test, it'd be good to know exactly when aging is being done or
not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
help make the self-test set the right expectations for aging.

The Kconfig would also allow a user to know that, no matter what,
we're going to get correct age data for VMs, even if, say, we're using
the shadow MMU. This is somewhat important for me/Google Cloud. Is
that reasonable? Maybe there's a better solution.