linux-kernel - Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aWfNMKCoEp2LuA2v@tiehlicka>
Date: Wed, 14 Jan 2026 18:06:56 +0100
From: Michal Hocko <mhocko@...e.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
	Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>,
	David Rientjes <rientjes@...gle.com>, christian.koenig@....com,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	SeongJae Park <sj@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
	Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R . Howlett" <liam.howlett@...cle.com>,
	Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Christian Brauner <brauner@...nel.org>,
	Wei Yang <richard.weiyang@...il.com>,
	David Hildenbrand <david@...hat.com>,
	Miaohe Lin <linmiaohe@...wei.com>,
	Al Viro <viro@...iv.linux.org.uk>, linux-mm@...ck.org,
	linux-trace-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Mateusz Guzik <mjguzik@...il.com>,
	Matthew Wilcox <willy@...radead.org>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection
 with 2-pass algorithm

On Wed 14-01-26 09:59:15, Mathieu Desnoyers wrote:
> Use the hierarchical tree counter approximation (hpcc) to implement the
> OOM killer task selection with a 2-pass algorithm. The first pass
> selects the process that has the highest badness points approximation,
> and the second pass compares each process using the current max badness
> points approximation.
> 
> The second pass uses an approximate comparison to eliminate all processes
> which are below the current max badness points approximation accuracy
> range.
> 
> Summing the per-CPU counters to calculate the precise badness of tasks
> is only required for tasks with an approximate badness within the
> accuracy range of the current max points value.
> 
> Limit to 16 the maximum number of badness sums allowed for an OOM killer
> task selection before falling back to the approximated comparison. This
> ensures bounded execution time for scenarios where many tasks have
> badness within the accuracy of the maximum badness approximation.
> 
> Testing the execution time of select_bad_process() with a single
> tail -f /dev/zero:
> 
>     AMD EPYC 9654 96-Core (2 sockets)
>     Within a KVM, configured with 256 logical cpus.
> 
>                                       | precise sum |   hpcc   |
>     ----------------------------------|-------------|----------|
>     nr_processes=40                   |    0.5 ms   |   0.3 ms |
>     nr_processes=10000                |   80.0 ms   |   7.9 ms |
> 
> Tested with the following script:

I am confused by these numbers. Are you saying that 2 pass over all
tasks and evaluating all of them is 10 times faster than a single pass
with exact sum of pcp counters?

> 
>   #!/bin/sh
> 
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   sleep 5
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   sleep 2
>   for a in $(seq 1 10); do (tail /dev/zero &); done
>   echo "Waiting for tasks to finish"
>   wait
> 
> Results: OOM kill order on a 128GB memory system
> ================================================

I find this section confusing as well. Is that before/after comparision.
If yes it would be great to call out explicit behavior before and after.

My overall impression is that the implementation is really involved and
at this moment I do not really see a big benefit of all the complexity.

It would help to explicitly mention what is the the overall imprecision
of the oom victim selection with the new data structure (maybe this is
good enough[*]). What if we go with exact precision with the new data
structure comparing to the original pcp counters.


[*] please keep in mind that oom victim selection is by no means an
exact science, we try to pick up a task that is likely to free up some
memory to unlock the system from memory depletion. We want that to be a
big memory consumer to reduce number of tasks to kill and we want to
roughly apply oom_score_adj.
-- 
Michal Hocko
SUSE Labs