linux-kernel - Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHEwfYpmahzg1NsurZWe5Of-kwX3JJaWvm=LA4_rC-CdKQ@mail.gmail.com>
Date: Thu, 24 Apr 2025 18:03:11 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: "Christoph Lameter (Ampere)" <cl@...two.org>
Cc: Harry Yoo <harry.yoo@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	David Rientjes <rientjes@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>, 
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, 
	Vlad Buslov <vladbu@...dia.com>, Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>, 
	Byungchul Park <byungchul@...com>, linux-mm@...ck.org, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
 allocator scalability problem

On Thu, Apr 24, 2025 at 5:50 PM Christoph Lameter (Ampere)
<cl@...two.org> wrote:
>
> On Thu, 24 Apr 2025, Harry Yoo wrote:
>
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
>
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
>
> Could you keep a cache of recently used per cpu regions so that you can
> avoid frequent percpu allocation operation?
>
> You could allocate larger percpu areas for a batch of them and
> then assign as needed.

I was considering a mechanism like that earlier, but the changes
needed to make it happen would result in worse state for the
alloc/free path.

RSS counters are embedded into mm with only the per-cpu areas being a
pointer. The machinery maintains a global list of all of their
instances, i.e. the pointers to internal to mm_struct. That is to say
even if you deserialized allocation of percpu memory itself, you would
still globally serialize on adding/removing the counters to the global
list.

But suppose this got reworked somehow and this bit ceases to be a problem.

Another spot where mm alloc/free globally serializes (at least on
x86_64) is pgd_alloc/free on the global pgd_lock.

Suppose you managed to decompose the lock into a finer granularity, to
the point where it does not pose a problem from contention standpoint.
Even then that's work which does not have to happen there.

General theme is there is a lot of expensive work happening when
dealing with mm lifecycle (*both* from single- and multi-threaded
standpoint) and preferably it would only be dealt with once per
object's existence.
-- 
Mateusz Guzik <mjguzik gmail.com>