linux-kernel - Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <24ed69ca-7914-455e-ae8c-5f24f52aa377@efficios.com>
Date: Mon, 15 Dec 2025 09:21:34 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
 SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core
 systems

On 2025-12-15 09:08, Mathieu Desnoyers wrote:
> On 2025-12-14 18:35, Andrew Morton wrote:
>> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers 
>> <mathieu.desnoyers@...icios.com> wrote:
> [...]
>>>
>>> Andrew, are you interested to try this out in mm-new ?
>>
>> Yes.  We have to start somewhere.
> 
> Cool !
> 
>>
>> As you kind of mention, it's going to be difficult to determine when
>> this is ready to go upstream.  I assume that to really know this will
>> required detailed and lengthy fleet-wide operation and observation.
> 
> For that kind of feature, yes, this is my expectation as well.
> 
>> What sort of drawbacks do you think people miht encounter with this
>> change?
> 
> Let's see, here are some possible drawbacks to keep an eye out for:
> 
> - Taking for instance a machine with 256 logical CPUs topology,
>    although allocation for small amount of memory is typically handled
>    with a this_cpu_add_return, when doing large memory allocations, this
>    will trickle up the carry over 3 levels, each of which require an
>    atomic_add_return.
> 
>    The upstream implementation would instead go straight for a global
>    spinlock, which may or may not be better than 3 atomics.
> 
> - 2-pass OOM killer task selection: with a large number of tasks, and
>    small number of CPUs, the upstream algorithm would be adequately
>    precise, and faster because it does a single iteration pass. So the
>    open question here is do we care about overhead of the OOM killer task
>    selection ?
> 
> - I understanding that some people implement their own OOM killer in
>    userspace based on RSS values exposed through /proc. Because those
>    RSS values are the precise counts (split-counter sums), there should
>    be no difference there compared to the upstream implementation, but
>    there would be no performance gain as well. It may be interesting
>    to eventually expose the counter approximations (and the accuracy
>    intervals) to userspace so it could speed up its task selection
>    eventually. Not really a drawback, more something to keep in mind as
>    future improvement.
> 
> - I took care not to add additional memory allocation to the mm
>    allocation/free code because it regresses some benchmarks.
>    Still it's good to keep an eye out for bot reports about those
>    regressions.
> 
> - The intermediate tree levels counters use extra memory. This is
>    a tradeoff between compactness and cache locality of the counters.
>    I currently used cache-aligned integers (thus favored cache locality
>    and eliminating false-sharing), but I have other prototypes which
>    use packed bytes for the intermediate levels. For instance, on a
>    256 core machine, we have 37 intermediate levels nodes, for a total
>    of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
>    for the per-cpu counters). If we choose to instead go for the packed
>    bytes approach, the 37 intermediate levels nodes will use 37 bytes
>    of memory, but there will be false-sharing across those counters.
> 
>    An alternative approach there is to use a strided allocator [1] with a
>    byte counter set allocator on top [2]. This way we can benefit from the
>    memory savings of byte counters without the false-sharing.

One more point:

- The choice of constants is an educated guess at best and would require
   testing/feedback on real-world workloads:

   - The batch size (32),

   - The n-arity of the counter tree for each power-of-two number of CPUs
     (per_nr_cpu_order_config).

   - The choice of making the number of intermediate level counter bits
     match the n-arity of the counters aggregated into that level is also
     arbitrary.

   - The precise badness sums limit (16) for an OOM killer task selection.
     This is perhaps something that could become a sysctl tunable.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com