linux-kernel - Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on large many-core systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c5d48b86-6b8e-4695-bbfa-a308d59eba52@efficios.com>
Date: Tue, 13 Jan 2026 17:16:16 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
 SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>,
 Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, stable@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
 Roman Gushchin <roman.gushchin@...ux.dev>, Mateusz Guzik
 <mjguzik@...il.com>, Matthew Wilcox <willy@...radead.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v1 1/1] mm: Fix OOM killer and proc stats inaccuracy on
 large many-core systems

On 2026-01-13 16:46, Andrew Morton wrote:
> On Tue, 13 Jan 2026 14:47:34 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> 
>> Use the precise, albeit slower, precise RSS counter sums for the OOM
>> killer task selection and proc statistics. The approximated value is
>> too imprecise on large many-core systems.
> 
> Thanks.
> 
> Problem: if I also queue your "mm: Reduce latency of OOM killer task
> selection" series then this single patch won't get tested, because the
> larger series erases this patch, yes?

That's a good point.

> 
> Obvious solution: aim this patch at next-merge-window and let's look at
> the larger series for the next -rc cycle.  Thoughts?

Yes, that works for me. Does it mean I should re-submit the hpcc
series after the next merge window closes, or do you keep a queue of
stuff waiting for the next -rc cycle somewhere ?

> 
>> The following rss tracking issues were noted by Sweet Tea Dorminy [1],
>> which lead to picking wrong tasks as OOM kill target:
>>
>>    Recently, several internal services had an RSS usage regression as part of a
>>    kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
>>    read RSS statistics in a backup watchdog process to monitor and decide if
>>    they'd overrun their memory budget. Now, however, a representative service
>>    with five threads, expected to use about a hundred MB of memory, on a 250-cpu
>>    machine had memory usage tens of megabytes different from the expected amount
>>    -- this constituted a significant percentage of inaccuracy, causing the
>>    watchdog to act.
>>
>>    This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
>>    into percpu_counter") [1].  Previously, the memory error was bounded by
>>    64*nr_threads pages, a very livable megabyte. Now, however, as a result of
>>    scheduler decisions moving the threads around the CPUs, the memory error could
>>    be as large as a gigabyte.
>>
>>    This is a really tremendous inaccuracy for any few-threaded program on a
>>    large machine and impedes monitoring significantly. These stat counters are
>>    also used to make OOM killing decisions, so this additional inaccuracy could
>>    make a big difference in OOM situations -- either resulting in the wrong
>>    process being killed, or in less memory being returned from an OOM-kill than
>>    expected.
>>
>> Here is a (possibly incomplete) list of the prior approaches that were
>> used or proposed, along with their downside:
>>
>> 1) Per-thread rss tracking: large error on many-thread processes.
>>
>> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
>>     increased system time in make test workloads [1]. Moreover, the
>>     inaccuracy increases with O(n^2) with the number of CPUs.
>>
>> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
>>     error is high with systems that have lots of NUMA nodes (32 times
>>     the number of NUMA nodes).
>>
>> The simple fix proposed here is to do the precise per-cpu counters sum
>> every time a counter value needs to be read. This applies to the OOM
>> killer task selection, to the /proc statistics, and to the oom mark_victim
>> trace event.
>>
>> Note that commit 82241a83cd15 ("mm: fix the inaccurate memory statistics
>> issue for users") introduced get_mm_counter_sum() for precise proc
>> memory status queries for _some_ proc files. This change renames
>> get_mm_counter_sum() to get_mm_counter(), thus moving the rest of the
>> proc files to the precise sum.
> 
> Please confirm - switching /proc functions from get_mm_counter_sum() to
> get_mm_counter_sum() doesn't actually change anything, right?  It would
> be concerning to add possible overhead to things like task_statm().

The approach proposed by this patch is to switch all proc ABIs which
query RSS to the precise sum to eliminate any discrepancy caused by too
imprecise approximate sums. It's a big hammer, and it can slow down
those proc interfaces, including task_statm(). Is it an issue ?

The hpcc series introduces an approximation which provides accuracy
limits on the approximation that make the result is still somewhat
meaninful on large many core systems.

The overall approach here would be to move back those proc interfaces
which care about low overhead to the hpcc approximate sum when it lands
upstream. But in order to learn that, we need to know which proc
interface files are performance-sensitive. How can we get that data ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com