[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <24ed69ca-7914-455e-ae8c-5f24f52aa377@efficios.com>
Date: Mon, 15 Dec 2025 09:21:34 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
Johannes Weiner <hannes@...xchg.org>,
Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
<rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v10 0/3] mm: Fix OOM killer inaccuracy on large many-core
systems
On 2025-12-15 09:08, Mathieu Desnoyers wrote:
> On 2025-12-14 18:35, Andrew Morton wrote:
>> On Sat, 13 Dec 2025 13:56:05 -0500 Mathieu Desnoyers
>> <mathieu.desnoyers@...icios.com> wrote:
> [...]
>>>
>>> Andrew, are you interested to try this out in mm-new ?
>>
>> Yes. We have to start somewhere.
>
> Cool !
>
>>
>> As you kind of mention, it's going to be difficult to determine when
>> this is ready to go upstream. I assume that to really know this will
>> required detailed and lengthy fleet-wide operation and observation.
>
> For that kind of feature, yes, this is my expectation as well.
>
>> What sort of drawbacks do you think people miht encounter with this
>> change?
>
> Let's see, here are some possible drawbacks to keep an eye out for:
>
> - Taking for instance a machine with 256 logical CPUs topology,
> although allocation for small amount of memory is typically handled
> with a this_cpu_add_return, when doing large memory allocations, this
> will trickle up the carry over 3 levels, each of which require an
> atomic_add_return.
>
> The upstream implementation would instead go straight for a global
> spinlock, which may or may not be better than 3 atomics.
>
> - 2-pass OOM killer task selection: with a large number of tasks, and
> small number of CPUs, the upstream algorithm would be adequately
> precise, and faster because it does a single iteration pass. So the
> open question here is do we care about overhead of the OOM killer task
> selection ?
>
> - I understanding that some people implement their own OOM killer in
> userspace based on RSS values exposed through /proc. Because those
> RSS values are the precise counts (split-counter sums), there should
> be no difference there compared to the upstream implementation, but
> there would be no performance gain as well. It may be interesting
> to eventually expose the counter approximations (and the accuracy
> intervals) to userspace so it could speed up its task selection
> eventually. Not really a drawback, more something to keep in mind as
> future improvement.
>
> - I took care not to add additional memory allocation to the mm
> allocation/free code because it regresses some benchmarks.
> Still it's good to keep an eye out for bot reports about those
> regressions.
>
> - The intermediate tree levels counters use extra memory. This is
> a tradeoff between compactness and cache locality of the counters.
> I currently used cache-aligned integers (thus favored cache locality
> and eliminating false-sharing), but I have other prototypes which
> use packed bytes for the intermediate levels. For instance, on a
> 256 core machine, we have 37 intermediate levels nodes, for a total
> of 2368 bytes (that's in addition to the 1024 bytes of per-cpu memory
> for the per-cpu counters). If we choose to instead go for the packed
> bytes approach, the 37 intermediate levels nodes will use 37 bytes
> of memory, but there will be false-sharing across those counters.
>
> An alternative approach there is to use a strided allocator [1] with a
> byte counter set allocator on top [2]. This way we can benefit from the
> memory savings of byte counters without the false-sharing.
One more point:
- The choice of constants is an educated guess at best and would require
testing/feedback on real-world workloads:
- The batch size (32),
- The n-arity of the counter tree for each power-of-two number of CPUs
(per_nr_cpu_order_config).
- The choice of making the number of intermediate level counter bits
match the n-arity of the counters aggregated into that level is also
arbitrary.
- The precise badness sums limit (16) for an OOM killer task selection.
This is perhaps something that could become a sysctl tunable.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists