linux-kernel - Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f014a9bb-0653-4682-8608-7fe6e2ad5ee6@efficios.com>
Date: Tue, 8 Apr 2025 15:29:40 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: "Christoph Lameter (Ampere)" <cl@...two.org>
Cc: Sweet Tea Dorminy <sweettea@...gle.com>, Mateusz Guzik
 <mjguzik@...il.com>, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
 Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
 Johannes Weiner <hannes@...xchg.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
 Christian Brauner <brauner@...nel.org>, Wei Yang
 <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
 Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters

On 2025-04-08 12:37, Christoph Lameter (Ampere) wrote:
> On Tue, 8 Apr 2025, Mathieu Desnoyers wrote:
> 
>> - Minimize contention when incrementing and decrementing counters,
>> - Provide fast access to a sum approximation,
> 
> In general I like this as a abstraction of the Zoned VM counters in
> vmstat.c that will make the scalable counters there useful elsewhere.

I'm glad you like it! :)

> 
>> It aims at fixing the per-mm RSS tracking which has become too
>> inaccurate for OOM killer purposes on large many-core systems [1].
> 
> There are numerous cases where these issues occur. I know of a few I could
> use something like this.

I'm looking forward to see it used.

> 
>> The hierarchical per-CPU counters propagate a sum approximation through
>> a binary tree. When reaching the batch size, the carry is propagated
>> through a binary tree which consists of log2(nr_cpu_ids) levels. The
>> batch size for each level is twice the batch size of the prior level.
> 
> A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
> on a 512 cpu machine. Given the inflation of the number of cpus this
> scheme better work up to 8K cpus.

That's a good point. We should probably tune the tree n-arity to limit
the tree depth to about 5. Here is a table I've done showing a possible
n-arity choice for each power of 2 nr_cpu_ids:

nr_cpu_ids that can be represented at depth N for tree n-arity
2, 4, 8, and 16:

    N      2^N        4^N             8^N                 16^N
--------------------------------------------------------------
    0        1          1               1                    1
    1        2          4               8                   16
    2        4         16              64                  256
    3        8         64             512                 4096
    4       16        256            4096               *65536*
    5       32       1024          *32768*             1048576
    6       64       4096          262144             16777216
    7      128     *16384*        2097152            268435456
    8      256      65536        16777216           4294967296
    9      512     262144       134217728          68719476736
   10     1024    1048576      1073741824        1099511627776
   11     2048    4194304      8589934592       17592186044416
   12     4096   16777216     68719476736      281474976710656
   13    *8192*  67108864    549755813888     4503599627370496
   14    16384  268435456   4398046511104    72057594037927936
   15    32768 1073741824  35184372088832  1152921504606846976
   16    65536 4294967296 281474976710656 18446744073709551616

nr_levels is the number of tree levels for carry propagation
(excludes the final accumulation counter).

nr_cpu_ids   n-arity          nr_levels
       2         2                 1
       4         2                 2
       8         2                 3
      16         4                 3
      32         4                 3
      64         4                 3
     128         8                 3
     256         8                 3
     512         8                 3
    1024         8                 4
    2048         8                 4
    4096         8                 4
    8192        16                 4
   16384        16                 4
   32768        16                 4
   65536        16                 4
  131072        16                 5
  262144        16                 5
  524288        16                 5
1048576        16                 5

> 
>> +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
>> +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
>> +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v);
> 
> Precise? Concurrent counter updates can occur while determining the global
> value. People may get confused.

If counters are incremented or decremented concurrently, of course a "sum"
reader will either observe the value before or after the increment.
This is inherent to reading a counter concurrently with its updates.
However, the same can be said of reading a global counter concurrently
with its updates.

If you introduce an external synchronization mechanism that makes the
updaters quiescent (e.g. RCU) before reading the counter value, then
you have a sum which is guaranteed to include all prior increments/decrements.

So using the term "precise" does not appear misleading to me from that
perspective.

> Also maybe there would be a need for a function to collape the values into
> the global if f.e. a cpu goes off line or in order to switch off OS
> activities on a cpu.

Currently percpu_counter_tree_precise_sum_unbiased() iterates on each
possible cpu, which does not require cpu hotplug integration.

Indeed, as an improvement, we could integrate with cpu hotplug notifiers
and move the offlining CPU counters to a global accumulator, but this would
require the precise sum to synchronize against concurrent CPU hotplug. That
would allow the precise sum iteration to only consider online cpus.

Note that the hierarchical counter tree algorithms for both the
increment/decrement and for approximate/precise sums are conveniently
lock-free at the moment.

We'd have to consider whether it's preferable to keep lock-freedom or
introduce locking for the precise sum to iterate on fewer CPUs.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com