lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2m3wwqpha2jlo4zjn6xbucahfufej75gbaxxgh4j4h67pgrw7b@diodkog7ygk3>
Date: Wed, 2 Apr 2025 17:00:34 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Sweet Tea Dorminy <sweettea-kernel@...miny.me>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Christian König <christian.koenig@....com>, 
	Johannes Weiner <hannes@...xchg.org>, Sweet Tea Dorminy <sweettea@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>, 
	Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Christian Brauner <brauner@...nel.org>, Wei Yang <richard.weiyang@...il.com>, 
	David Hildenbrand <david@...hat.com>, Miaohe Lin <linmiaohe@...wei.com>, 
	Al Viro <viro@...iv.linux.org.uk>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Mateusz Guzik <mjguzik@...il.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of
 percpu_counters

On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
> [Resend as requested as RFC and minus prereq-patch-id junk]
> 
> Recently, several internal services had an RSS usage regression as part of a
> kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
> read RSS statistics in a backup watchdog process to monitor and decide if
> they'd overrun their memory budget.

Any reason these applications are not using memcg stats/usage instead of
RSS? RSS is not the only memory comsumption for these applications.

> Now, however, a representative service
> with five threads, expected to use about a hundred MB of memory, on a 250-cpu
> machine had memory usage tens of megabytes different from the expected amount
> -- this constituted a significant percentage of inaccuracy, causing the
> watchdog to act.

Are these 5 threads jump all over the 250 cpus?

> 
> This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") [1].  Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.

Applications with 10s of thousands of threads is very normal at Google.
So, inaccuracy should be comparable for such applications.

> 
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
> 
> Finally, while the change to percpu_counter does significantly improve the
> accuracy over the previous per-thread error for many-threaded services, it does
> also have performance implications - up to 12% slower for short-lived processes
> and 9% increased system time in make test workloads [2].
> 
> A previous attempt to address this regression by Peng Zhang [3] used a hybrid
> approach with delayed allocation of percpu memory for rss_stats, showing
> promising improvements of 2-4% for process operations and 6.7% for page
> faults.
> 
> This RFC takes a different direction by replacing percpu_counters with a
> more efficient set of per-NUMA-node atomics. The approach:
> 
> - Uses one atomic per node up to a bound to reduce cross-node updates.
> - Keeps a similar batching mechanism, with a smaller batch size.
> - Eliminates the use of a spin lock during batch updates, bounding stat
>   update latency.
> - Reduces percpu memory usage and thus thread startup time.

That one atomic per node will easily become a bottleneck for
applications with a lot of threads particularly on the system where
there are a lot of cpus per numa node.

> 
> Most importantly, this bounds the total error to 32 times the number of NUMA
> nodes, significantly smaller than previous error bounds.
> 
> On a 112-core machine, lmbench showed comparable results before and after this
> patch.  However, on a 224 core machine, performance improvements were

How many cpus per node for each of these machines?

> significant over percpu_counter:
> - Pagefault latency improved by 8.91%

The following fork ones are understandable as percpu counter allocation
is involved but the above page fault latency needs some explanation.

> - Process fork latency improved by 6.27%
> - Process fork/execve latency improved by 6.06%
> - Process fork/exit latency improved by 6.58%
> 
> will-it-scale also showed significant improvements on these machines.

Are these process ones or the threads ones?

> 
> [1] https://lore.kernel.org/all/20221024052841.3291983-1-shakeelb@google.com/
> [2] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3/
> [3] https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/
> 
> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@...miny.me>
> Cc: Yu Zhao <yuzhao@...gle.com>
> Cc: Roman Gushchin <roman.gushchin@...ux.dev>
> Cc: Shakeel Butt <shakeel.butt@...ux.dev>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Mateusz Guzik <mjguzik@...il.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> 
> ---
> 
> This is mostly a resend of an earlier patch, where I made an utter hash
> of specifying a base commit (and forgot to update my commit text to not
> call it an RFC, and forgot to update my email to the one I use for
> upstream work...). This is based on akpm/mm-unstable as of today.
> 
> v1 can be found at
> https://lore.kernel.org/lkml/20250325221550.396212-1-sweettea-kernel@dorminy.me/
> 
> Some interesting ideas came out of that discussion: Mathieu Desnoyers
> has a design doc for a improved percpu counter, multi-level, with
> constant drift, at 
> https://lore.kernel.org/lkml/a89cb4d9-088e-4ed6-afde-f1b097de8db9@efficios.com/
> and would like performance comparisons against just reducing the batch
> size in the existing code;

You can do the experiments with different batch sizes in the existing
code without waiting for Mathieu's multi-level percpu counter.

> and Mateusz Guzik would also like a more general solution and is also
> working to fix the performance issues by caching mm state. Finally,
> Lorenzo Stoakes nacks, as it's too speculative and needs more
> discussion.
> 
> I think the important part is that this improves accuracy; the current
> scheme is difficult to use on many-cored machines. It improves
> performance, but there are tradeoffs; but it tightly bounds the
> inaccuracy so that decisions can actually be reasonably made with the
> resulting numbers.
> 
> This patch assumes that intra-NUMA node atomic updates are very cheap

The above statement/assumption needs experimental data.

> and that
> assigning CPUs to an atomic counter by numa_node_id() % 16 is suitably
> balanced. However, if each atomic were shared by only, say, eight CPUs from the
> same NUMA node, this would further reduce atomic contention at the cost of more
> memory and more complicated assignment of CPU to atomic index. I don't think
> that additional complication is worth it given that this scheme seems to get
> good performance, but it might be. I do need to actually test the impact
> on a many-cores-one-NUMA-node machine, and I look forward to testing out
> Mathieu's heirarchical percpu counter with bounded error.
> 

I am still not buying the 'good performance' point. To me we might need
to go with reduced batch size of existing approach or multi level
approach from Mathieu (I still have to see Mateusz and Kairui's
proposals).

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ