[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEyX1gdwG_MVf-M2KMHBE1Juo6VbfSyp3rbXoS+5JaNtw@mail.gmail.com>
Date: Sat, 29 Nov 2025 06:57:21 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Gabriel Krisman Bertazi <krisman@...e.de>, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Shakeel Butt <shakeel.butt@...ux.dev>, Michal Hocko <mhocko@...nel.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...two.org>,
Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for
single-threaded tasks
On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@...e.cz> wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > What would really reduce memory allocation overhead on fork
> > is to move all those fields into a top level
> > "struct mm_percpu_struct" as a first step. This would
> > merge 3 per-cpu allocations into one when forking a new
> > task.
> >
> > Then the second step is to create a mm_percpu_struct
> > cache to bypass the per-cpu allocator.
> >
> > I suspect that by doing just that we'd get most of the
> > performance benefits provided by the single-threaded special-case
> > proposed here.
>
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.
>
The entire ordeal has been discussed several times already. I'm rather
disappointed there is a new patchset posted which does not address any
of it and goes straight to special-casing single-threaded operation.
The major claims (by me anyway) are:
1. single-threaded operation for fork + exec suffers avoidable
overhead even without the rss counter problem, which are tractable
with the same kind of thing which would sort out the multi-threaded
problem
2. unfortunately there is an increasing number of multi-threaded (and
often short lived) processes (example: lld, the linker form the llvm
project; more broadly plenty of things Rust where people think
threading == performance)
Bottom line is, solutions like the one proposed in the patchset are at
best a stopgap and even they leave performance on the table for the
case they are optimizing for.
The pragmatic way forward (as I see it anyway) is to fix up the
multi-threaded thing and see if trying to special case for
single-threaded case is justifiable afterwards.
Given that the current patchset has to resort to atomics in certain
cases, there is some error-pronnes and runtime overhead associated
with it going beyond merely checking if the process is
single-threaded, which puts an additional question mark on it.
Now to business:
You mentioned the rss loops are a problem. I agree, but they can be
largely damage-controlled. More importantly there are 2 loops of the
sort already happening even with the patchset at hand.
mm_alloc_cid() results in one loop in the percpu allocator to zero out
the area, then mm_init_cid() performs the following:
for_each_possible_cpu(i) {
struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
pcpu_cid->cid = MM_CID_UNSET;
pcpu_cid->recent_cid = MM_CID_UNSET;
pcpu_cid->time = 0;
}
There is no way this is not visible already on 256 threads.
Preferably some magic would be done to init this on first use on given
CPU.There is some bitmap tracking CPU presence, maybe this can be
tackled on top of it. But for the sake of argument let's say that's
too expensive or perhaps not feasible. Even then, the walk can be done
*once* by telling the percpu allocator to refrain from zeroing memory.
Which brings me to rss counters. In the current kernel that's
*another* loop over everything to zero it out. But it does not have to
be that way. Suppose bitmap shenanigans mentioned above are no-go for
these as well.
So instead the code could reach out to the percpu allocator to
allocate memory for both cid and rss (as mentined by Mathieu), but
have it returned uninitialized and loop over it once sorting out both
cid and rss in the same body. This should be drastically faster than
the current code.
But one may observe it is an invariant the values sum up to 0 on process exit.
So if one was to make sure the first time this is handed out by the
percpu allocator the values are all 0s and then cache the area
somewhere for future allocs/frees of mm, there would be no need to do
the zeroing on alloc.
On the free side summing up rss counters in check_mm() is only there
for debugging purposes. Suppose it is useful enough that it needs to
stay. Even then, as implemented right now, this is just slow for no
reason:
for (i = 0; i < NR_MM_COUNTERS; i++) {
long x = percpu_counter_sum(&mm->rss_stat[i]);
[snip]
}
That's *four* loops with extra overhead of irq-trips for every single
one. This can be patched up to only do one loop, possibly even with
irqs enabled the entire time.
Doing the loop is still slower than not doing it, but his may be just
fast enough to obsolete the ideas like in the proposed patchset.
While per-cpu level caching for all possible allocations seems like
the easiest way out, it in fact does *NOT* fully solve problem -- you
are still going to globally serialize in lru_gen_add_mm() (and the del
part), pgd_alloc() and other places.
Or to put it differently, per-cpu caching of mm_struct itself makes no
sense in the current kernel (with the patchset or not) because on the
way to finish the alloc or free you are going to globally serialize
several times and *that* is the issue to fix in the long run. You can
make the problematic locks fine-grained (and consequently alleviate
the scalability aspect), but you are still going to suffer the
overhead of taking them.
As far as I'm concerned the real long term solution(tm) would make the
cached mm's retain the expensive to sort out state -- list presence,
percpu memory and whatever else.
To that end I see 2 feasible approaches:
1. a dedicated allocator with coarse granularity
Instead of per-cpu, you could have an instance for every n threads
(let's say 8 or whatever). this would pose a tradeoff between total
memory usage and scalability outside of a microbenchmark setting. you
are still going to serialize in some cases, but only once on alloc and
once on free, not several times and you are still cheaper
single-threaded. This is faster all around.
2. dtor support in the slub allocator
ctor does the hard work and dtor undoes it. There is an unfinished
patchset by Harry which implements the idea[1].
There is a serious concern about deadlock potential stemming from
running arbitrary dtor code during memory reclaim. I already described
elsewhere how with a little bit of discipline supported by lockdep
this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
take any locks if you hold them and you have to disable interrupts) +
mark dtors as only allowed to hold a leaf spinlock et voila, code
guaranteed to not deadlock). But then all code trying to cache its
state in to be undone with dtor has to be patched to facilitate it.
Again bugs in the area sorted out by lockdep.
The good news is that folks were apparently open to punting reclaim of
such memory into a workqueue, which completely alleviates that concern
anyway.
So happens if fork + exit is involved there are numerous other
bottlenecks which overshadow the above, but that's a rant for another
day. Here we can pretend for a minute they are solved.
[1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads
Powered by blists - more mailing lists