linux-kernel - Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aCHUs55069p91eD7@harry>
Date: Mon, 12 May 2025 20:00:03 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Pedro Falcato <pfalcato@...e.de>, Vlastimil Babka <vbabka@...e.cz>,
        Christoph Lameter <cl@...two.org>,
        David Rientjes <rientjes@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
        Vlad Buslov <vladbu@...dia.com>,
        Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>,
        Byungchul Park <byungchul@...com>, linux-mm@...ck.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the
 percpu allocator scalability problem

On Wed, Apr 30, 2025 at 09:49:02PM +0200, Mateusz Guzik wrote:
> On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@...e.de> wrote:
> > With regards to "leaf locks", I still don't really understand what you/Mateusz
> > mean or how that's even enforceable from the get-go.
> >
> > So basically:
> > - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
> >   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
> >   caches become GFP_ATOMIC-incompatible)
> >
> > - ->dtor *will* do fancy things like recursing back onto the slab allocator and
> >   grabbing locks
> >
> > - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
> >   to dispose of slabs. It can also uncontrollably do $whatever.
> >
> > - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
> >   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
> >   cache_free/slab disposal issues
> >
> > - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
> >   issues by purely starting up shrinkers on direct reclaim as well.
> >
> > - the whole original "Slab object caching allocator" idea from 1992 is extremely
> >   confusing and works super poorly with various debugging features (like, e.g,
> >   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
> >   TYPESAFE_BY_RCU, that we *really* need.
> >
> > These are basically my issues with the whole idea. I highly disagree that we should
> > open this pandora's box for problems in *other places*.
> 
> Now to business:
> I'm going to start with pointing out that dtors callable from any
> place are not an *inherent* requirement of the idea. Given that
> apparently sheaves don't do direct reclaim and that Christoph's idea
> does not do it either, I think there is some support for objs with
> unsafe dtors *not* being direct reclaimable (instead there can be a
> dedicated workqueue or some other mechanism sorting them out). I did
> not realize something like this would be considered fine. It is the
> easiest way out and is perfectly fine with me.

My concern is that this prevents reclaimable slabs from using a
destructure in the future. Probably doesn't matter now, but it
doesn't seem future-proof.

> However, suppose objs with dtors do need to be reclaimable the usual way.
> 
> I claim that writing dtors which are safe to use in that context is
> not a significant challenge. Moreover, it is also possible to extend
> lockdep to validate correct behavior. Finally, test code can trigger
> ctor and dtor calls for all slabs to execute all this code at least
> once with lockdep enabled. So while *honest* mistakes with locking are
> very much possible, they will be trivially caught and I don't believe
> the box which is being opened here belongs to Pandora.
> 
> So here is another attempt at explaning leaf spinlocks.
> 
> Suppose you have a global lock named "crapper". Further suppose the
> lock is only taken with _irqsave *and* no locks are taken while
> holding it.
> 
> Say this is the only consumer:
> void crapperbump(void) {
>     int flags;
>     spin_lock_irqsave(&crapper, flags);
>     mehvar++;
>     spin_unlock_irqsave(&crapper);
> }
> 
> Perhaps you can agree *anyone* can call here at any point and not risk
> deadlocking.
> 
> That's an example of a leaf lock.
> 
> Aight, so how does one combat cases where the code turns into:
> spin_lock_irqsave(&crapper, flags);
> spin_lock_irqsave(&meh, flags2);
> 
> In this case "crapper" is no longer a leaf lock and in principle there
> may be lock ordering involving "meh" which does deadlock.
> 
> Here is an example way out: when initializing the "crapper" lock, it
> gets marked as a leaf lock so that lockdep can check for it. Then on a
> lockdep-enabled kernel you get a splat when executing the routine when
> you get to locking "meh". This sorts out the ctor side of things.
> 
> How does one validate dtor? lockdep can have a "only leaf-locks
> allowed in this area" tunable around calls to dtors. Then should a
> dtor have ideas of acquiring a lock which is not a leaf lock you are
> once more going to get a splat.

Similar to the concern above, this would limit how users can use the
destructor feature (let's say percpu violated the rule, altough it
doesn't appear to, then we would have to modify percpu allocator to
conform to it).

So I think the most future-proof approach (as Vlastimil suggested
off-list) would be to reclaim slab memory and call dtors in a separate
context (e.g., dedicated workqueues as you mentioned)?

But before doing this I need to check if it's okay to increase the nr of
reclaimed pages by the current task, when we know it'll be reclaimed but
not reclaimed yet.

-- 
Cheers,
Harry / Hyeonggon