linux-kernel - Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHGYC=TdLfzQ0wW5NOrZh1Xy9aSHEYiMLqv_ekX5f52M7g@mail.gmail.com>
Date: Wed, 3 Dec 2025 12:54:34 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Gabriel Krisman Bertazi <krisman@...e.de>
Cc: Jan Kara <jack@...e.cz>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Shakeel Butt <shakeel.butt@...ux.dev>, 
	Michal Hocko <mhocko@...nel.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Christoph Lameter <cl@...two.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for
 single-threaded tasks

On Wed, Dec 3, 2025 at 12:02 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@...e.de> wrote:
> >
> > Mateusz Guzik <mjguzik@...il.com> writes:
> > > The major claims (by me anyway) are:
> > > 1. single-threaded operation for fork + exec suffers avoidable
> > > overhead even without the rss counter problem, which are tractable
> > > with the same kind of thing which would sort out the multi-threaded
> > > problem
> >
> > Agreed, there are more issues in the fork/exec path than just the
> > rss_stat.  The rss_stat performance is particularly relevant to us,
> > though, because it is a clear regression for single-threaded introduced
> > in 6.2.
> >
> > I took the time to test the slab constructor approach with the
> > /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> > the 80c machine, which, granted, is an artificial benchmark, but still a
> > good stressor of the single-threaded case.  With this patchset, I
> > reported 6% improvement, getting it close to the performance before the
> > pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> > allocation and initialization all together for the single-threaded case,
> > where it is not necessary, will always be better than speeding up the
> > allocation (even though that a worthwhile effort itself, as Mathieu
> > pointed out)
>
> I'm fine with the benchmark method, but it was used on a kernel which
> remains gimped by the avoidably slow walk in check_mm which I already
> talked about.
>
> Per my prior commentary and can be patched up to only do the walk once
> instead of 4 times, and without taking locks.
>
> But that's still more work than nothing and let's say that's still too
> slow. 2 ideas were proposed how to avoid the walk altogether: I
> proposed expanding the tlb bitmap and Mathieu went with the cid
> machinery. Either way the walk over all CPUs is not there.
>

So I got another idea and it boils down to coalescing cid init with
rss checks on exit.

I repeat that with your patchset the single-threaded case is left with
one walk on alloc (for cid stuff) and that's where issues arise for
machines with tons of cpus.

If the walk gets fixed, the same method can be used to avoid the walk
for rss, obsoleting the patchset.

So let's say it is unfixable for the time being.

mm_init_cid stores a bunch of -1 per-cpu. I'm assuming this can't be changed.

One can still handle allocation in ctor/dtor and make it an invariant
that the state present is ready to use, so in particular mm_init_cid
was already issued on it.

Then it is on the exit side to clean it up and this is where the walk
checks rss state *and* reinits cid in one loop.

Excluding the repeat lock and irq trips which don't need to be there,
I take it almost all of the overhead is cache misses. WIth one loop
that's sorted out.

Maybe I'm going to hack it up, but perhaps Mathieu or Harry would be
happy to do it? (or have a better idea?)

> With the walk issue fixed and all allocations cached thanks ctor/dtor,
> even the single-threaded fork/exec will be faster than it is with your
> patch thanks to *never* reaching to the per-cpu allocator (with your
> patch it is still going to happen for the cid stuff).
>
> Additionally there are other locks which can be elided later with the
> ctor/dtor pair, further improving perf.
>
> >
> > > 2. unfortunately there is an increasing number of multi-threaded (and
> > > often short lived) processes (example: lld, the linker form the llvm
> > > project; more broadly plenty of things Rust where people think
> > > threading == performance)
> >
> > I don't agree with this argument, though.  Sure, there is an increasing
> > amount of multi-threaded applications, but this is not relevant.  The
> > relevant argument is the amount of single-threaded workloads. One
> > example are coreutils, which are spawned to death by scripts.  I did
> > take the care of testing the patchset with a full distro on my
> > day-to-day laptop and I wasn't surprised to see the vast majority of
> > forked tasks never fork a second thread.  The ones that do are most
> > often long-lived applications, where the cost of mm initialization is
> > way less relevant to the overall system performance.  Another example is
> > the fact real-world benchmarks, like kernbench, can be improved with
> > special-casing single-threads.
> >
>
> I stress one more time that a full fixup for the situation as I
> described above not only gets rid of the problem for *both* single-
> and multi- threaded operation, but ends up with code which is faster
> than your patchset even for the case you are patching for.
>
> The multi-threaded stuff *is* very much relevant because it is
> increasingly more common (see below). I did not claim that
> single-threaded workloads don't matter.
>
> I would not be arguing here if there was no feasible way to handle
> both or if handling the multi-threaded case still resulted in
> measurable overhead for single-threaded workloads.
>
> Since you mention configure scripts, I'm intimately familiar with
> large-scale building as a workload. While it is true that there is
> rampant usage of shell, sed and whatnot (all of which are
> single-threaded), things turn multi-threaded (and short-lived) very
> quickly once you go past the gnu toolchain and/or c/c++ codebases.
>
> For example the llvm linker is multi-threaded and short-lived. Since
> most real programs are small, during a large scale build of different
> programs you end up with tons of lld spawning and quitting all the
> time.
>
> Beyond that java, erlang, zig and others like to multi-thread as well.
>
> Rust is an emerging ecosystem where people think adding threading
> equals automatically better performance and where crate authors think
> it's fine to sneak in threads (my favourite offender is the ctrlc
> crate). And since Rust is growing in popularity you can expect the
> kind of single-threaded tooling you see right now will turn
> multi-threaded from under you over time.
>
> > > The pragmatic way forward (as I see it anyway) is to fix up the
> > > multi-threaded thing and see if trying to special case for
> > > single-threaded case is justifiable afterwards.
> > >
> > > Given that the current patchset has to resort to atomics in certain
> > > cases, there is some error-pronnes and runtime overhead associated
> > > with it going beyond merely checking if the process is
> > > single-threaded, which puts an additional question mark on it.
> >
> > I don't get why atomics would make it error-prone.  But, regarding the
> > runtime overhead, please note the main point of this approach is that
> > the hot path can be handled with a simple non-atomic variable write in
> > the task context, and not the atomic operation. The later is only used
> > for infrequent case where the counter is touched by an external task
> > such as OOM, khugepaged, etc.
> >
>
> The claim is there may be a bug where something should be using the
> atomic codepath but is not.