linux-kernel - Re: [PATCH] mm: use per-numa-node atomics instead of percpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <n6lvd6chmhpc2cdzjwwvetwvpkqxc5ajqdzhcuzc2fajveo5qv@3u4r4y3sa2qx>
Date: Wed, 26 Mar 2025 22:54:23 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@...miny.me>, 
	Andrew Morton <akpm@...ux-foundation.org>, Steven Rostedt <rostedt@...dmis.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Jani Nikula <jani.nikula@...el.com>, 
	Sweet Tea Dorminy <sweettea@...gle.com>, Johannes Weiner <hannes@...xchg.org>, 
	Christian Brauner <brauner@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Suren Baghdasaryan <surenb@...gle.com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>, 
	Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>, 
	Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	Matthew Wilcox <willy@...radead.org>, paulmck <paulmck@...nel.org>, Yu Zhao <yuzhao@...gle.com>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Greg Thelen <gthelen@...gle.com>, shakeel.butt@...ux.dev
Subject: Re: [PATCH] mm: use per-numa-node atomics instead of percpu_counters

On Wed, Mar 26, 2025 at 03:56:15PM -0400, Mathieu Desnoyers wrote:
> On 2025-03-25 18:15, Sweet Tea Dorminy wrote:
> > From: Sweet Tea Dorminy <sweettea@...gle.com>
> > 
> > Recently, several internal services had an RSS usage regression as part of a
> > kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> > read RSS statistics in a backup watchdog process to monitor and decide if
> > they'd overrun their memory budget. Now, however, a representative service
> > with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> > machine had memory usage tens of megabytes different from the expected amount
> > -- this constituted a significant percentage of inaccuracy, causing the
> > watchdog to act.
> > 
> 
> I suspect the culprit sits here:
> 
> int percpu_counter_batch __read_mostly = 32;
> EXPORT_SYMBOL(percpu_counter_batch);
> 
> static int compute_batch_value(unsigned int cpu)
> {
>         int nr = num_online_cpus();
> 
>         percpu_counter_batch = max(32, nr*2);
>         return 0;
> }
> 
> So correct me if I'm wrong, but in this case the worse-case
> inaccuracy for a 256 cpu machine would be
> "+/- percpu_counter_batch" within each percpu counter,
> thus globally:
> 
> +/- (256 * 2) * 256, or 131072 pages, meaning an inaccuracy
> of +/- 512MB with 4kB pages. This is quite significant.
> 
> So I understand that the batch size is scaled up as the
> number of CPUs increases to minimize contention on the
> percpu_counter lock. Unfortunately, as the number of CPUs
> increases, the inaccuracy increases with the square of the
> number of cpus.
> 
> Have you tried decreasing this percpu_counter_batch value on
> larger machines to see if it helps ?
> 

per-cpu rss counters replaced a per-thread variant, which for
sufficiently threaded processes had a significantly bigger error.

See f1a7941243c102a4 ("mm: convert mm's rss stats into percpu_counter").

The use in rss aside, the current implementation of per-cpu counters is
met with two seemingly conflicting requirements: on one hand,
synchronisation with other CPUs needs to be rare to maintain
scalability, on the other the more CPUs are there to worry about,
the bigger the error vs the central value and the more often you should
synchronize it.

So I think something needs to be done about the mechansism in general.

While I don't have throught out idea, off hand I suspect turning these
into a hierarchical state should help solve it?

As in instead of *one* central value everyone writes to in order to
offload their batch, there could be a level or two of intermediary
values -- think of a tree you go up as needed.

Then for example the per-cpu batch could be much smaller as the penalty
for rolling it up to one level higher would be significantly lower than
going after the main counter.

I have no time to work on something like this though. Myabe someone has
a better idea.