[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFZhrNwM8bnkFUkad4x_ibZZqbax_psF7CX_SrFQprJbw@mail.gmail.com>
Date: Wed, 3 Dec 2025 12:02:07 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Gabriel Krisman Bertazi <krisman@...e.de>
Cc: Jan Kara <jack@...e.cz>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Shakeel Butt <shakeel.butt@...ux.dev>,
Michal Hocko <mhocko@...nel.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...two.org>, Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka <vbabka@...e.cz>,
Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for
single-threaded tasks
On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@...e.de> wrote:
>
> Mateusz Guzik <mjguzik@...il.com> writes:
> > The major claims (by me anyway) are:
> > 1. single-threaded operation for fork + exec suffers avoidable
> > overhead even without the rss counter problem, which are tractable
> > with the same kind of thing which would sort out the multi-threaded
> > problem
>
> Agreed, there are more issues in the fork/exec path than just the
> rss_stat. The rss_stat performance is particularly relevant to us,
> though, because it is a clear regression for single-threaded introduced
> in 6.2.
>
> I took the time to test the slab constructor approach with the
> /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in
> the 80c machine, which, granted, is an artificial benchmark, but still a
> good stressor of the single-threaded case. With this patchset, I
> reported 6% improvement, getting it close to the performance before the
> pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> allocation and initialization all together for the single-threaded case,
> where it is not necessary, will always be better than speeding up the
> allocation (even though that a worthwhile effort itself, as Mathieu
> pointed out)
I'm fine with the benchmark method, but it was used on a kernel which
remains gimped by the avoidably slow walk in check_mm which I already
talked about.
Per my prior commentary and can be patched up to only do the walk once
instead of 4 times, and without taking locks.
But that's still more work than nothing and let's say that's still too
slow. 2 ideas were proposed how to avoid the walk altogether: I
proposed expanding the tlb bitmap and Mathieu went with the cid
machinery. Either way the walk over all CPUs is not there.
With the walk issue fixed and all allocations cached thanks ctor/dtor,
even the single-threaded fork/exec will be faster than it is with your
patch thanks to *never* reaching to the per-cpu allocator (with your
patch it is still going to happen for the cid stuff).
Additionally there are other locks which can be elided later with the
ctor/dtor pair, further improving perf.
>
> > 2. unfortunately there is an increasing number of multi-threaded (and
> > often short lived) processes (example: lld, the linker form the llvm
> > project; more broadly plenty of things Rust where people think
> > threading == performance)
>
> I don't agree with this argument, though. Sure, there is an increasing
> amount of multi-threaded applications, but this is not relevant. The
> relevant argument is the amount of single-threaded workloads. One
> example are coreutils, which are spawned to death by scripts. I did
> take the care of testing the patchset with a full distro on my
> day-to-day laptop and I wasn't surprised to see the vast majority of
> forked tasks never fork a second thread. The ones that do are most
> often long-lived applications, where the cost of mm initialization is
> way less relevant to the overall system performance. Another example is
> the fact real-world benchmarks, like kernbench, can be improved with
> special-casing single-threads.
>
I stress one more time that a full fixup for the situation as I
described above not only gets rid of the problem for *both* single-
and multi- threaded operation, but ends up with code which is faster
than your patchset even for the case you are patching for.
The multi-threaded stuff *is* very much relevant because it is
increasingly more common (see below). I did not claim that
single-threaded workloads don't matter.
I would not be arguing here if there was no feasible way to handle
both or if handling the multi-threaded case still resulted in
measurable overhead for single-threaded workloads.
Since you mention configure scripts, I'm intimately familiar with
large-scale building as a workload. While it is true that there is
rampant usage of shell, sed and whatnot (all of which are
single-threaded), things turn multi-threaded (and short-lived) very
quickly once you go past the gnu toolchain and/or c/c++ codebases.
For example the llvm linker is multi-threaded and short-lived. Since
most real programs are small, during a large scale build of different
programs you end up with tons of lld spawning and quitting all the
time.
Beyond that java, erlang, zig and others like to multi-thread as well.
Rust is an emerging ecosystem where people think adding threading
equals automatically better performance and where crate authors think
it's fine to sneak in threads (my favourite offender is the ctrlc
crate). And since Rust is growing in popularity you can expect the
kind of single-threaded tooling you see right now will turn
multi-threaded from under you over time.
> > The pragmatic way forward (as I see it anyway) is to fix up the
> > multi-threaded thing and see if trying to special case for
> > single-threaded case is justifiable afterwards.
> >
> > Given that the current patchset has to resort to atomics in certain
> > cases, there is some error-pronnes and runtime overhead associated
> > with it going beyond merely checking if the process is
> > single-threaded, which puts an additional question mark on it.
>
> I don't get why atomics would make it error-prone. But, regarding the
> runtime overhead, please note the main point of this approach is that
> the hot path can be handled with a simple non-atomic variable write in
> the task context, and not the atomic operation. The later is only used
> for infrequent case where the counter is touched by an external task
> such as OOM, khugepaged, etc.
>
The claim is there may be a bug where something should be using the
atomic codepath but is not.
Powered by blists - more mailing lists