lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c99778c3-6ef0-48de-98ac-10913419ec90@efficios.com>
Date: Tue, 13 Jan 2026 08:51:45 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Michal Hocko <mhocko@...e.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
 "Paul E. McKenney" <paulmck@...nel.org>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
 Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
 Christoph Lameter <cl@...ux.com>, Martin Liu <liumartin@...gle.com>,
 David Rientjes <rientjes@...gle.com>, christian.koenig@....com,
 Shakeel Butt <shakeel.butt@...ux.dev>, SeongJae Park <sj@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>,
 Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
 Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Mateusz Guzik <mjguzik@...il.com>, Matthew Wilcox <willy@...radead.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core
 systems

On 2026-01-13 04:24, Michal Hocko wrote:
[...]
>> Would you be OK with introducing changes in the following order ?
>>
>> 1) Fix the OOM killer inaccuracy by using counter sum (iteration on all
>>     cpu counters) in task selection. This may slow down the oom killer,
>>     but would at least fix its current inaccuracy issues. This could be
>>     backported to stable kernels.
>>
>> 2) Introduce the hierarchical percpu counters on top, as a oom killer
>>     task selection performance optimization (reduce latency of oom kill).
>>
>> This way, (2) becomes purely a performance optimization, so it's easy
>> to bissect and revert if it causes issues.
> 
> Yes, this makes more sense.
> 
>> I agree that bringing a fix along with a performance optimization within
>> a single commit makes it hard to backport to stable, and tricky to
>> revert if it causes problems.
>>
>> As for finding other users of the hpcc, I have ideas, but not so much
>> time available to try them out, as I'm pretty much doing this in my
>> spare time.
> 
> I do understand this constrain and motivation to have OOM situation
> addressed with a priority. I am pretty sure that if you see issues in
> OOM path then other consumers of get_mm_counter would be affected as
> well. Namely /proc/<pid>/stat.

Indeed /proc/<pid>/stat (implemented in fs/proc/array.c:do_task_stat())
uses get_mm_rss() which currently exports the approximated value to
userspace.

> There might be others but I can imagine
> that some of them are more performance than precision sensitive.

Agreed.

> All that being said it seems that we need slow-and-precise and
> fast-approximate interfaces to have incremental path for other users as
> well. Looking at patch 1 it seems there are interfaces available for
> that. I think it would be great to call those out explicitly in the
> highlevel doc to give some guidance what to use when with what kind of
> expectations.

I figured I'd first focus on the oom killers internals before tackling
the userspace ABI aspect of the problem, but since you're bringing it
up, here is what I have in mind, more or less:

- Introduce new proc files, e.g.

   /proc/<pid>/rss/approximate
   /proc/<pid>/rss/precise

Where the "approximate" file would export the following lines for each
page type (MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS, MM_SHMPAGES,
allowing future additions):

<page type> <approximate> <precise_sum_min> <precise_sum_max>

And "precise" would export lines for each page type:

<page type> <precise_sum>

The key thing here is to have different files to query approximated
vs precise values, so we don't have the overhead of the precise sum
when all we need is an approximation.

This would expose all the bits and pieces needed to allow userspace to
implement something similar to the 2-pass algorithm I'm proposing for
the OOM killer, but tweaked for other use-cases.

This proposed ABI is purely hypothetical at this stage. Please let me
know if you have something different in mind.

When you mention "highlevel doc", which document do you have in mind ?
Something related to lib/percpu_counter_tree.c or to the /proc ABI ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ