linux-kernel - Re: [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f2c04264-17ca-418f-bc43-e8aa6fa6cd0d@efficios.com>
Date: Mon, 12 Jan 2026 14:37:49 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 "Paul E. McKenney" <paulmck@...nel.org>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
 Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
 Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>,
 David Rientjes <rientjes@...gle.com>, christian.koenig@....com,
 Shakeel Butt <shakeel.butt@...ux.dev>, SeongJae Park <sj@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>,
 Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
 Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Mateusz Guzik <mjguzik@...il.com>, Matthew Wilcox <willy@...radead.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core
 systems

On 2026-01-12 03:42, Michal Hocko wrote:
> Hi,
> sorry to jump in this late but the timing of previous versions didn't
> really work well for me.
> 
> On Sun 11-01-26 14:49:57, Mathieu Desnoyers wrote:
> [...]
>> Here is a (possibly incomplete) list of the prior approaches that were
>> used or proposed, along with their downside:
>>
>> 1) Per-thread rss tracking: large error on many-thread processes.
>>
>> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
>>     increased system time in make test workloads [1]. Moreover, the
>>     inaccuracy increases with O(n^2) with the number of CPUs.
>>
>> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
>>     error is high with systems that have lots of NUMA nodes (32 times
>>     the number of NUMA nodes).
>>
>> The approach proposed here is to replace this by the hierarchical
>> per-cpu counters, which bounds the inaccuracy based on the system
>> topology with O(N*logN).
> 
> The concept of hierarchical pcp counter is interesting and I am
> definitely not opposed if there are more users that would benefit.
> 
>  From the OOM POV, IIUC the primary problem is that get_mm_counter
> (percpu_counter_read_positive) is too imprecise on systems when the task
> is moving around a large number of cpus. In the list of alternative
> solutions I do not see percpu_counter_sum_positive to be mentioned.
> oom_badness() is a really slow path and taking the slow path to
> calculate a much more precise value seems acceptable. Have you
> considered that option?
I must admit I assumed that since there was already a mechanism in place
to ensure it's not necessary to sum per-cpu counters when the oom killer
is trying to select tasks, it must be because this

   O(nr_possible_cpus * nr_processes)

operation must be too slow for the oom killer requirements.

AFAIU, the oom killer is executed when the memory allocator fails to
allocate memory, which can be within code paths which need to progress
eventually. So even though it's a slow path compared to the allocator
fast path, there must be at least _some_ expectations about it
completing within a decent amount of time. What would that ballpark be ?

To give an order of magnitude, I've tried modifying the upstream
oom killer to use percpu_counter_sum_positive and compared it to
the hierarchical approach:

AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.

                    nr_processes=40    nr_processes=10000
Counter sum:            0.4 ms             81.0 ms
HPCC with 2-pass:       0.3 ms              9.3 ms

So as we scale up the number of processes on large SMP systems,
the latency caused by the oom killer task selection greatly
increases with the counter sums compared with the hierarchical
approach.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com