linux-kernel - Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f1a1b7ba-c084-4e41-9242-8255f2664b76@efficios.com>
Date: Fri, 28 Nov 2025 08:30:08 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriel Krisman Bertazi <krisman@...e.de>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, jack@...e.cz,
 Mateusz Guzik <mjguzik@...il.com>, Shakeel Butt <shakeel.butt@...ux.dev>,
 Michal Hocko <mhocko@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...two.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for
 single-threaded tasks

On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
> The cost of the pcpu memory allocation is non-negligible for systems
> with many cpus, and it is quite visible when forking a new task, as
> reported in a few occasions.
I've come to the same conclusion within the development of
the hierarchical per-cpu counters.

But while the mm_struct has a SLAB cache (initialized in
kernel/fork.c:mm_cache_init()), there is no such thing
for the per-mm per-cpu data.

In the mm_struct, we have the following per-cpu data (please
let me know if I missed any in the maze):

- struct mm_cid __percpu *pcpu_cid (or equivalent through
   struct mm_mm_cid after Thomas Gleixner gets his rewrite
   upstream),

- unsigned int __percpu *futex_ref,

- NR_MM_COUNTERS rss_stats per-cpu counters.

What would really reduce memory allocation overhead on fork
is to move all those fields into a top level
"struct mm_percpu_struct" as a first step. This would
merge 3 per-cpu allocations into one when forking a new
task.

Then the second step is to create a mm_percpu_struct
cache to bypass the per-cpu allocator.

I suspect that by doing just that we'd get most of the
performance benefits provided by the single-threaded special-case
proposed here.

I'm not against special casing single-threaded if it's still
worth it after doing the underlying data structure layout/caching
changes I'm proposing here, but I think we need to fix the
memory allocation overhead issue first before working around it
with special cases and added complexity.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com