linux-kernel - Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zl5LqcusZ88QOGQY@google.com>
Date: Mon, 3 Jun 2024 16:03:05 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Paolo Bonzini <pbonzini@...hat.com>, Albert Ou <aou@...s.berkeley.edu>, 
	Ankit Agrawal <ankita@...dia.com>, Anup Patel <anup@...infault.org>, 
	Atish Patra <atishp@...shpatra.org>, Axel Rasmussen <axelrasmussen@...gle.com>, 
	Bibo Mao <maobibo@...ngson.cn>, Catalin Marinas <catalin.marinas@....com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	Huacai Chen <chenhuacai@...nel.org>, James Morse <james.morse@....com>, 
	Jonathan Corbet <corbet@....net>, Marc Zyngier <maz@...nel.org>, Michael Ellerman <mpe@...erman.id.au>, 
	Nicholas Piggin <npiggin@...il.com>, Oliver Upton <oliver.upton@...ux.dev>, 
	Palmer Dabbelt <palmer@...belt.com>, Paul Walmsley <paul.walmsley@...ive.com>, 
	Raghavendra Rao Ananta <rananta@...gle.com>, Ryan Roberts <ryan.roberts@....com>, 
	Shaoqin Huang <shahuang@...hat.com>, Shuah Khan <shuah@...nel.org>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Tianrui Zhao <zhaotianrui@...ngson.cn>, 
	Will Deacon <will@...nel.org>, Zenghui Yu <yuzenghui@...wei.com>, kvm-riscv@...ts.infradead.org, 
	kvm@...r.kernel.org, kvmarm@...ts.linux.dev, 
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	linux-mips@...r.kernel.org, linux-mm@...ck.org, 
	linux-riscv@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org, 
	loongarch@...ts.linux.dev
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate
 in aging

On Mon, Jun 03, 2024, James Houghton wrote:
> On Thu, May 30, 2024 at 11:06 PM Yu Zhao <yuzhao@...gle.com> wrote:
> > What I don't think is acceptable is simplifying those optimizations
> > out without documenting your justifications (I would even call it a
> > design change, rather than simplification, from v3 to v4).
> 
> I'll put back something similar to what you had before (like a
> test_clear_young() with a "fast" parameter instead of "bitmap"). I
> like the idea of having a new mmu notifier, like
> fast_test_clear_young(), while leaving test_young() and clear_young()
> unchanged (where "fast" means "prioritize speed over accuracy").

Those two statements are contradicting each other, aren't they?  Anyways, I vote
for a "fast only" variant, e.g. test_clear_young_fast_only() or so.  gup() has
already established that terminology in mm/, so hopefully it would be familiar
to readers.  We could pass a param, but then the MGLRU code would likely end up
doing a bunch of useless indirect calls into secondary MMUs, whereas a dedicated
hook allows implementations to nullify the pointer if the API isn't supported
for whatever reason.

And pulling in Oliver's comments about locking, I think it's important that the
mmu_notifier API express it's requirement that the operation be "fast", not that
it be lockless.  E.g. if a secondary MMU can guarantee that a lock will be
contented only in rare, slow cases, then taking a lock is a-ok.  Or a secondary
MMU could do try-lock and bail if the lock is contended.

That way KVM can honor the intent of the API with an implementation that works
best for KVM _and_ for MGRLU.  I'm sure there will be future adjustments and fixes,
but that's just more motivation for using something like "fast only" instead of
"lockless".

> > > I made this logic change as part of removing batching.
> > >
> > > I'd really appreciate guidance on what the correct thing to do is.
> > >
> > > In my mind, what would work great is: by default, do aging exactly
> > > when KVM can do it locklessly, and then have a Kconfig to always have
> > > MGLRU to do aging with KVM if a user really cares about proactive
> > > reclaim (when the feature bit is set). The selftest can check the
> > > Kconfig + feature bit to know for sure if aging will be done.
> >
> > I still don't see how that Kconfig helps. Or why the new static branch
> > isn't enough?
> 
> Without a special Kconfig, the feature bit just tells us that aging
> with KVM is possible, not that it will necessarily be done. For the
> self-test, it'd be good to know exactly when aging is being done or
> not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
> help make the self-test set the right expectations for aging.
> 
> The Kconfig would also allow a user to know that, no matter what,
> we're going to get correct age data for VMs, even if, say, we're using
> the shadow MMU.

Heh, unless KVM flushes, you won't get "correct" age data.

> This is somewhat important for me/Google Cloud. Is that reasonable? Maybe
> there's a better solution.

Hmm, no?  There's no reason to use a Kconfig, e.g. if we _really_ want to prioritize
accuracy over speed, then a KVM (x86?) module param to have KVM walk nested TDP
page tables would give us what we want.

But before we do that, I think we need to perform due dilegence (or provide data)
showing that having KVM take mmu_lock for write in the "fast only" API provides
better total behavior.  I.e. that the additional accuracy is indeed worth the cost.