[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEwfYpmahzg1NsurZWe5Of-kwX3JJaWvm=LA4_rC-CdKQ@mail.gmail.com>
Date: Thu, 24 Apr 2025 18:03:11 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: "Christoph Lameter (Ampere)" <cl@...two.org>
Cc: Harry Yoo <harry.yoo@...cle.com>, Vlastimil Babka <vbabka@...e.cz>,
David Rientjes <rientjes@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Vlad Buslov <vladbu@...dia.com>, Yevgeny Kliteynik <kliteyn@...dia.com>, Jan Kara <jack@...e.cz>,
Byungchul Park <byungchul@...com>, linux-mm@...ck.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu
allocator scalability problem
On Thu, Apr 24, 2025 at 5:50 PM Christoph Lameter (Ampere)
<cl@...two.org> wrote:
>
> On Thu, 24 Apr 2025, Harry Yoo wrote:
>
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
>
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
>
> Could you keep a cache of recently used per cpu regions so that you can
> avoid frequent percpu allocation operation?
>
> You could allocate larger percpu areas for a batch of them and
> then assign as needed.
I was considering a mechanism like that earlier, but the changes
needed to make it happen would result in worse state for the
alloc/free path.
RSS counters are embedded into mm with only the per-cpu areas being a
pointer. The machinery maintains a global list of all of their
instances, i.e. the pointers to internal to mm_struct. That is to say
even if you deserialized allocation of percpu memory itself, you would
still globally serialize on adding/removing the counters to the global
list.
But suppose this got reworked somehow and this bit ceases to be a problem.
Another spot where mm alloc/free globally serializes (at least on
x86_64) is pgd_alloc/free on the global pgd_lock.
Suppose you managed to decompose the lock into a finer granularity, to
the point where it does not pose a problem from contention standpoint.
Even then that's work which does not have to happen there.
General theme is there is a lot of expensive work happening when
dealing with mm lifecycle (*both* from single- and multi-threaded
standpoint) and preferably it would only be dealt with once per
object's existence.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists