linux-kernel - Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20251214153550.10f171f0c98e4ece9a0f1bfe@linux-foundation.org>
Date: Sun, 14 Dec 2025 15:35:50 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>, Tejun Heo
 <tj@...nel.org>, Christoph Lameter <cl@...ux.com>, Martin Liu
 <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>, SeongJae
 Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>, Johannes Weiner
 <hannes@...xchg.org>, Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, "Liam R . Howlett"
 <liam.howlett@...cle.com>, Mike Rapoport <rppt@...nel.org>, Suren
 Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
 Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large
 many-core systems

On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:

> Introduce hierarchical per-cpu counters and use them for RSS tracking to
> fix the per-mm RSS tracking which has become too inaccurate for OOM
> killer purposes on large many-core systems.
> 
> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
> which lead to picking wrong tasks as OOM kill target:
> 
>   Recently, several internal services had an RSS usage regression as part of a
>   kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
>   read RSS statistics in a backup watchdog process to monitor and decide if
>   they'd overrun their memory budget. Now, however, a representative service
>   with five threads, expected to use about a hundred MB of memory, on a 250-cpu
>   machine had memory usage tens of megabytes different from the expected amount
>   -- this constituted a significant percentage of inaccuracy, causing the
>   watchdog to act.
> 
>   This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
>   into percpu_counter") [1].  Previously, the memory error was bounded by
>   64*nr_threads pages, a very livable megabyte. Now, however, as a result of
>   scheduler decisions moving the threads around the CPUs, the memory error could
>   be as large as a gigabyte.
> 
>   This is a really tremendous inaccuracy for any few-threaded program on a
>   large machine and impedes monitoring significantly. These stat counters are
>   also used to make OOM killing decisions, so this additional inaccuracy could
>   make a big difference in OOM situations -- either resulting in the wrong
>   process being killed, or in less memory being returned from an OOM-kill than
>   expected.
> 
> The approach proposed here is to replace this by the hierarchical
> per-cpu counters, which bounds the inaccuracy based on the system
> topology with O(N*logN).
> 
> Notable change for v10: The new patch 3/3 changes the implementation of
> the oom killer task selection to a 2-pass algorithm, where the first
> pass uses the fast approximation provided by the hierarchical percpu
> counters, and the second pass does a precise sum for all tasks which
> have badness values within the range of the approximation accuracy.
> 
> I've done moderate testing of this series on a 256-core VM with 128GB
> RAM. Figuring out whether this indeed helps solve issues with real-life
> workloads will require broader feedback from the community.
> 
> The one request I did not have time to fulfill yet is to port the
> tests from the librseq feature branch implementation (userspace) to the
> kernel selftests.
> 
> This series is based on v6.18.
> 
> Andrew, are you interested to try this out in mm-new ?

Yes.  We have to start somewhere.

As you kind of mention, it's going to be difficult to determine when
this is ready to go upstream.  I assume that to really know this will
required detailed and lengthy fleet-wide operation and observation.

What sort of drawbacks do you think people miht encounter with this
change?