linux-kernel - Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHF4Tj+LcMSEZK3H3TF0Xc2xzf_NWro3ghXddOwVHNRTuw@mail.gmail.com>
Date: Wed, 30 Apr 2025 21:49:02 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Pedro Falcato <pfalcato@...e.de>
Cc: Harry Yoo <harry.yoo@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Vlad Buslov <vladbu@...dia.com>, 
	Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>, Byungchul Park <byungchul@...com>, 
	linux-mm@...ck.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
 allocator scalability problem

On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@...e.de> wrote:
> With regards to "leaf locks", I still don't really understand what you/Mateusz
> mean or how that's even enforceable from the get-go.
>
> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
>   caches become GFP_ATOMIC-incompatible)
>
> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks
>
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.
>
> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
>
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.
>
> - the whole original "Slab object caching allocator" idea from 1992 is extremely
>   confusing and works super poorly with various debugging features (like, e.g,
>   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
>   TYPESAFE_BY_RCU, that we *really* need.
>
> These are basically my issues with the whole idea. I highly disagree that we should
> open this pandora's box for problems in *other places*.

Apologies for the late reply.

Looks like your primary apprehension concerns dtor, I'm going to
address it below.

But first a quick remark that the headline here is expensive
single-threaded work which keeps happening on every mm alloc/free and
which does not have to, extending beyond the percpu allocator
problems. Having a memory allocator which can handle it would be most
welcome.

Now to business:
I'm going to start with pointing out that dtors callable from any
place are not an *inherent* requirement of the idea. Given that
apparently sheaves don't do direct reclaim and that Christoph's idea
does not do it either, I think there is some support for objs with
unsafe dtors *not* being direct reclaimable (instead there can be a
dedicated workqueue or some other mechanism sorting them out). I did
not realize something like this would be considered fine. It is the
easiest way out and is perfectly fine with me.

However, suppose objs with dtors do need to be reclaimable the usual way.

I claim that writing dtors which are safe to use in that context is
not a significant challenge. Moreover, it is also possible to extend
lockdep to validate correct behavior. Finally, test code can trigger
ctor and dtor calls for all slabs to execute all this code at least
once with lockdep enabled. So while *honest* mistakes with locking are
very much possible, they will be trivially caught and I don't believe
the box which is being opened here belongs to Pandora.

So here is another attempt at explaning leaf spinlocks.

Suppose you have a global lock named "crapper". Further suppose the
lock is only taken with _irqsave *and* no locks are taken while
holding it.

Say this is the only consumer:
void crapperbump(void) {
    int flags;
    spin_lock_irqsave(&crapper, flags);
    mehvar++;
    spin_unlock_irqsave(&crapper);
}

Perhaps you can agree *anyone* can call here at any point and not risk
deadlocking.

That's an example of a leaf lock.

Aight, so how does one combat cases where the code turns into:
spin_lock_irqsave(&crapper, flags);
spin_lock_irqsave(&meh, flags2);

In this case "crapper" is no longer a leaf lock and in principle there
may be lock ordering involving "meh" which does deadlock.

Here is an example way out: when initializing the "crapper" lock, it
gets marked as a leaf lock so that lockdep can check for it. Then on a
lockdep-enabled kernel you get a splat when executing the routine when
you get to locking "meh". This sorts out the ctor side of things.

How does one validate dtor? lockdep can have a "only leaf-locks
allowed in this area" tunable around calls to dtors. Then should a
dtor have ideas of acquiring a lock which is not a leaf lock you are
once more going to get a splat.

And of course you can just force call all ctors and dtors on a debug
kernel (no need to trigger any memory pressure, just walk the list of
slabs with ctor + dtor pairs and call the stuff).

This of course would require some effort to implement, but there is no
rocket science degree required.

However, given that direct reclaim for mm is apparently not a strict
requirement, my preferred way out is to simply not provide it.
-- 
Mateusz Guzik <mjguzik gmail.com>