linux-kernel - Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHE3CgoeoCjW=qpvv0+AzsSGJQpRCMLObjnOO-uALa2xiQ@mail.gmail.com>
Date: Thu, 24 Apr 2025 18:11:31 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Pedro Falcato <pfalcato@...e.de>
Cc: Harry Yoo <harry.yoo@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Vlad Buslov <vladbu@...dia.com>, 
	Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>, Byungchul Park <byungchul@...com>, 
	linux-mm@...ck.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
 allocator scalability problem

On Thu, Apr 24, 2025 at 5:20 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@...e.de> wrote:
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> >
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> >
>
> I noted the need to use leaf spin locks in my IRC conversations with
> Harry and later in this very thread, it is a bummer this bit did not
> make into the cover letter -- hopefully it would have avoided this
> exchange.
>
> I'm going to summarize this again here:
>
> By API contract the dtor can only take a leaf spinlock, in this case one which:
> 1. disables irqs
> 2. is the last lock in the dependency chain, as in no locks are taken
> while holding it
>
> That way there is no possibility of a deadlock.
>
> This poses a question on how to enforce it and this bit is easy: for
> example one can add leaf-spinlock notion to lockdep. Then a misuse on
> allocation side is going to complain immediately even without
> triggering reclaim. Further, if one would feel so inclined, a test
> module can walk the list of all slab caches and do a populate/reclaim
> cycle on those with the ctor/dtor pair.
>
> Then there is the matter of particular consumers being ready to do
> what they need to on the dtor side only with the spinlock. Does not
> sound like a fundamental problem.
>
> > Case in point:
> > What happens if you allocate a slab and start ->ctor()-ing objects, and then
> > one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> > everything back (AIUI this is not handled in this series, yet). Besides this
> > complication, if failing dtors were added into the mix, we'd be left with a
> > half-initialized slab(!!) in the middle of the cache waiting to get freed,
> > without being able to.
> >
>
> Per my previous paragraph failing dtors would be a self-induced problem.
>
> I can agree one has to roll things back if ctors don't work out, but I
> don't think this poses a significant problem.
>
> > Then there are obviously other problems like: whatever you're calling must
> > not ever require the slab allocator (directly or indirectly) and must not
> > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > is a no-go (AIUI!) already because of such issues.
> >
>
> I don't see how that's true.
>
> > Then there's the separate (but adjacent, particularly as we're considering
> > this series due to performance improvements) issue that the ctor() and
> > dtor() interfaces are terrible, in the sense that they do not let you batch
> > in any way shape or form (requiring us to lock/unlock many times, allocate
> > many times, etc). If this is done for performance improvements, I would prefer
> > a superior ctor/dtor interface that takes something like a slab iterator and
> > lets you do these things.
> >
>
> Batching this is also something I mentioned and indeed is a "nice to
> have" change. Note however that the work you are suggesting to batch
> now also on every alloc/free cycle, so doing it once per creation of a
> given object instead is already a win.
>

Whether the ctor/dtor thing lands or not, I would like to point out
the current state is quite nasty and something(tm) needs to be done.

The mm object is allocated from a per-cpu cache, only to have the
mandatory initialization globally serialize *several* times, including
twice on the same lock. This is so bad that performance would be
better if someone created a globally-locked cache with mms still
holding onto per-cpu memory et al.

Or to put it differently, existence of a per-cpu cache of mm objs is
defeated by the global locking endured for each alloc/free cycle (and
this goes beyond percpu memory allocs for counters).

So another idea would be to instead create a cache with *some*
granularity (say 4 or 8 cpu threads per instance). Note this should
reduce the total number of mms allocated (but unused) in the system.
If mms hanging out there would still be populated like in this
patchset, perhaps the reduction in objs which are "wasted" would be
sufficient to ignore direct reclaim? Instead if need be this would be
reclaimable from a dedicated thread (whatever it is in Linux).

-- 
Mateusz Guzik <mjguzik gmail.com>