[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3d3f1675-7081-4744-bebd-2eb91c031d42@efficios.com>
Date: Mon, 15 Dec 2025 09:08:55 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
Johannes Weiner <hannes@...xchg.org>,
Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
<rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core
systems
On 2025-12-14 18:35, Andrew Morton wrote:
> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
[...]
>>
>> Andrew, are you interested to try this out in mm-new ?
>
> Yes. We have to start somewhere.
Cool !
>
> As you kind of mention, it's going to be difficult to determine when
> this is ready to go upstream. I assume that to really know this will
> required detailed and lengthy fleet-wide operation and observation.
For that kind of feature, yes, this is my expectation as well.
> What sort of drawbacks do you think people miht encounter with this
> change?
Let's see, here are some possible drawbacks to keep an eye out for:
- Taking for instance a machine with 256 logical CPUs topology,
although allocation for small amount of memory is typically handled
with a this_cpu_add_return, when doing large memory allocations, this
will trickle up the carry over 3 levels, each of which require an
atomic_add_return.
The upstream implementation would instead go straight for a global
spinlock, which may or may not be better than 3 atomics.
- 2-pass OOM killer task selection: with a large number of tasks, and
small number of CPUs, the upstream algorithm would be adequately
precise, and faster because it does a single iteration pass. So the
open question here is do we care about overhead of the OOM killer task
selection ?
- I understanding that some people implement their own OOM killer in
userspace based on RSS values exposed through /proc. Because those
RSS values are the precise counts (split-counter sums), there should
be no difference there compared to the upstream implementation, but
there would be no performance gain as well. It may be interesting
to eventually expose the counter approximations (and the accuracy
intervals) to userspace so it could speed up its task selection
eventually. Not really a drawback, more something to keep in mind as
future improvement.
- I took care not to add additional memory allocation to the mm
allocation/free code because it regresses some benchmarks.
Still it's good to keep an eye out for bot reports about those
regressions.
- The intermediate tree levels counters use extra memory. This is
a tradeoff between compactness and cache locality of the counters.
I currently used cache-aligned integers (thus favored cache locality
and eliminating false-sharing), but I have other prototypes which
use packed bytes for the intermediate levels. For instance, on a
256 core machine, we have 37 intermediate levels nodes, for a total
of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
for the per-cpu counters). If we choose to instead go for the packed
bytes approach, the 37 intermediate levels nodes will use 37 bytes
of memory, but there will be false-sharing across those counters.
An alternative approach there is to use a strided allocator [1] with a
byte counter set allocator on top [2]. This way we can benefit from the
memory savings of byte counters without the false-sharing.
Thanks,
Mathieu
[1] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/rseq-mempool.c
[2] https://github.com/compudj/librseq/blob/percpu-counter-byte/src/percpu-counter-tree.c#L190
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists