[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd7de95e-96b6-b957-2889-bf53d0a019e2@gentwo.org>
Date: Thu, 24 Apr 2025 09:39:08 -0700 (PDT)
From: "Christoph Lameter (Ampere)" <cl@...two.org>
To: Mateusz Guzik <mjguzik@...il.com>
cc: Harry Yoo <harry.yoo@...cle.com>, Vlastimil Babka <vbabka@...e.cz>,
David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Vlad Buslov <vladbu@...dia.com>, Yevgeny Kliteynik <kliteyn@...dia.com>,
Jan Kara <jack@...e.cz>, Byungchul Park <byungchul@...com>,
linux-mm@...ck.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
allocator scalability problem
On Thu, 24 Apr 2025, Mateusz Guzik wrote:
> > You could allocate larger percpu areas for a batch of them and
> > then assign as needed.
>
> I was considering a mechanism like that earlier, but the changes
> needed to make it happen would result in worse state for the
> alloc/free path.
>
> RSS counters are embedded into mm with only the per-cpu areas being a
> pointer. The machinery maintains a global list of all of their
> instances, i.e. the pointers to internal to mm_struct. That is to say
> even if you deserialized allocation of percpu memory itself, you would
> still globally serialize on adding/removing the counters to the global
> list.
>
> But suppose this got reworked somehow and this bit ceases to be a problem.
>
> Another spot where mm alloc/free globally serializes (at least on
> x86_64) is pgd_alloc/free on the global pgd_lock.
>
> Suppose you managed to decompose the lock into a finer granularity, to
> the point where it does not pose a problem from contention standpoint.
> Even then that's work which does not have to happen there.
>
> General theme is there is a lot of expensive work happening when
> dealing with mm lifecycle (*both* from single- and multi-threaded
> standpoint) and preferably it would only be dealt with once per
> object's existence.
Maybe change the lifecyle? Allocate a batch nr of entries initially from
the slab allocator and use them for multiple mm_structs as the need
arises.
Do not free them to the slab allocator until you
have too many that do nothing around?
You may also want to avoid counter updates with this scheme if you only
count the batchees useed. It will become a bit fuzzy but you improve scalability.
Powered by blists - more mailing lists