linux-kernel - Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3d3f1675-7081-4744-bebd-2eb91c031d42@efficios.com>
Date: Mon, 15 Dec 2025 09:08:55 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
 SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core
 systems

On 2025-12-14 18:35, Andrew Morton wrote:
> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
[...]
>>
>> Andrew, are you interested to try this out in mm-new ?
> 
> Yes.  We have to start somewhere.

Cool !

> 
> As you kind of mention, it's going to be difficult to determine when
> this is ready to go upstream.  I assume that to really know this will
> required detailed and lengthy fleet-wide operation and observation.

For that kind of feature, yes, this is my expectation as well.

> What sort of drawbacks do you think people miht encounter with this
> change?

Let's see, here are some possible drawbacks to keep an eye out for:

- Taking for instance a machine with 256 logical CPUs topology,
   although allocation for small amount of memory is typically handled
   with a this_cpu_add_return, when doing large memory allocations, this
   will trickle up the carry over 3 levels, each of which require an
   atomic_add_return.

   The upstream implementation would instead go straight for a global
   spinlock, which may or may not be better than 3 atomics.

- 2-pass OOM killer task selection: with a large number of tasks, and
   small number of CPUs, the upstream algorithm would be adequately
   precise, and faster because it does a single iteration pass. So the
   open question here is do we care about overhead of the OOM killer task
   selection ?

- I understanding that some people implement their own OOM killer in
   userspace based on RSS values exposed through /proc. Because those
   RSS values are the precise counts (split-counter sums), there should
   be no difference there compared to the upstream implementation, but
   there would be no performance gain as well. It may be interesting
   to eventually expose the counter approximations (and the accuracy
   intervals) to userspace so it could speed up its task selection
   eventually. Not really a drawback, more something to keep in mind as
   future improvement.

- I took care not to add additional memory allocation to the mm
   allocation/free code because it regresses some benchmarks.
   Still it's good to keep an eye out for bot reports about those
   regressions.

- The intermediate tree levels counters use extra memory. This is
   a tradeoff between compactness and cache locality of the counters.
   I currently used cache-aligned integers (thus favored cache locality
   and eliminating false-sharing), but I have other prototypes which
   use packed bytes for the intermediate levels. For instance, on a
   256 core machine, we have 37 intermediate levels nodes, for a total
   of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
   for the per-cpu counters). If we choose to instead go for the packed
   bytes approach, the 37 intermediate levels nodes will use 37 bytes
   of memory, but there will be false-sharing across those counters.

   An alternative approach there is to use a strided allocator [1] with a
   byte counter set allocator on top [2]. This way we can benefit from the
   memory savings of byte counters without the false-sharing.

Thanks,

Mathieu

[1] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/rseq-mempool.c
[2] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/percpu-counter-tree.c#L190

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com