lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Fri, 28 Jun 2024 16:38:57 -0700
From: James Houghton <jthoughton@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Paolo Bonzini <pbonzini@...hat.com>, Ankit Agrawal <ankita@...dia.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Catalin Marinas <catalin.marinas@....com>, 
	David Matlack <dmatlack@...gle.com>, David Rientjes <rientjes@...gle.com>, 
	James Morse <james.morse@....com>, Jonathan Corbet <corbet@....net>, Marc Zyngier <maz@...nel.org>, 
	Oliver Upton <oliver.upton@...ux.dev>, Raghavendra Rao Ananta <rananta@...gle.com>, 
	Ryan Roberts <ryan.roberts@....com>, Shaoqin Huang <shahuang@...hat.com>, 
	Suzuki K Poulose <suzuki.poulose@....com>, Wei Xu <weixugc@...gle.com>, 
	Will Deacon <will@...nel.org>, Zenghui Yu <yuzenghui@...wei.com>, kvmarm@...ts.linux.dev, 
	kvm@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Mon, Jun 17, 2024, James Houghton wrote:
> > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > > Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the shadow
> > > MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or rather,
> > > I suspect/hope we can get close enough for an initial merge, which would allow
> > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
> > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.
> > >
> > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
> > > in such a way that a lockless walk would be painfully complex.  But if there is
> > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
> > > that one SPTE.  And with nested TDP, unless L1 is doing something uncommon, e.g.
> > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
> > > have only one entry.  That's not the case for legacy shadow paging because kernels
> > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
> > > along with any userspace mappings.

Hi Sean, sorry for taking so long to get back to you.

So just to make sure I have this right: if L1 is using TDP, the gfns
in L0 will usually only be mapped by a single spte. If L1 is not using
TDP, then all bets are off. Is that true?

If that is true, given that we don't really have control over whether
or not L1 decides to use TDP, the lockless shadow MMU walk will work,
but, if L1 is not using TDP, it will often return false negatives
(says "old" for an actually-young gfn). So then I don't really
understand conditioning the lockless shadow MMU walk on us (L0) using
the TDP MMU[1]. We care about L1, right?

(Maybe you're saying that, when the TDP MMU is enabled, the only cases
where the shadow MMU is used are cases where gfns are practically
always mapped by a single shadow PTE. This isn't how I understood your
mail, but this is what your hack-a-patch[1] makes me think.)

[1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/

>
> ...
>
> > Hmm, interesting. I need to spend a little bit more time digesting this.
> >
> > Would you like to see this included in v6? (It'd be nice to avoid the
> > WAS_FAST stuff....) Should we leave it for a later series? I haven't
> > formed my own opinion yet.
>
> I would say it depends on the viability and complexity of my idea.  E.g. if it
> pans out more or less like my rough sketch, then it's probably worth taking on
> the extra code+complexity in KVM to avoid the whole WAS_FAST goo.
>
> Note, if we do go this route, the implementation would need to be tweaked to
> handle the difference in behavior between aging and last-minute checks for eviction,
> which I obviously didn't understand when I threw together that hack-a-patch.
>
> I need to think more about how best to handle that though, e.g. skipping GFNs with
> multiple mappings is probably the worst possible behavior, as we'd risk evicting
> hot pages.  But falling back to taking mmu_lock for write isn't all that desirable
> either.

I think falling back to the write lock is more desirable than evicting
a young page.

I've attached what I think could work, a diff on top of this series.
It builds at least. It uses rcu_read_lock/unlock() for
walk_shadow_page_lockless_begin/end(NULL), and it puts a
synchronize_rcu() in kvm_mmu_commit_zap_page().

It doesn't get rid of the WAS_FAST things because it doesn't do
exactly what [1] does. It basically makes three calls now: lockless
TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the
locked shadow MMU bits if the lockless bits say !young (instead of
being conditioned on tdp_mmu_enabled). My choice is definitely
questionable for the clear path.

Thanks!

View attachment "shadow-mmu-patch.diff" of type "text/x-patch" (6364 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ