[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3652f6b-cd3b-45ad-a3ad-e92369cf4068@efficios.com>
Date: Sat, 22 Nov 2025 12:15:02 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
Johannes Weiner <hannes@...xchg.org>,
Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
<rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>,
Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
Mateusz Guzik <mjguzik@...il.com>, Matthew Wilcox <willy@...radead.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters
On 2025-11-21 13:03, Andrew Morton wrote:
> On Thu, 20 Nov 2025 16:03:53 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>> ...
>> It aims at fixing the per-mm RSS tracking which has become too
>> inaccurate for OOM killer purposes on large many-core systems [1].
>
> Presentation nit: the info at [1] is rather well hidden until one reads
> the [2/2] changelog. You might want to move that material into the
> [0/N] - after all, it's the entire point of the patchset.
Will do.
>> The hierarchical per-CPU counters propagate a sum approximation through
>> a N-way tree. When reaching the batch size, the carry is propagated
>> through a binary tree which consists of logN(nr_cpu_ids) levels. The
>> batch size for each level is twice the batch size of the prior level.
>> ...
> Looks very neat.
I'm glad you like the approach :) When looking at this problem, a
hierarchical tree approach seemed like a good candidate to solve the
general problem.
>
> Have you identified other parts of the kernel which could use this?
I have not surveyed other possible users yet. At first glance, users
of the linux/percpu_counter.h percpu_counter_read_positive() may
be good candidates for the hierarchical tree. Here is a list:
lib/flex_proportions.c
157: num = percpu_counter_read_positive(&pl->events);
158: den = percpu_counter_read_positive(&p->events);
fs/file_table.c
88: return percpu_counter_read_positive(&nr_files);
fs/ext4/balloc.c
627: free_clusters = percpu_counter_read_positive(fcc);
628: dirty_clusters = percpu_counter_read_positive(dcc);
fs/ext4/ialloc.c
448: freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
450: freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
453: ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
fs/ext4/extents_status.c
1682: nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1694: ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1699: ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
fs/ext4/inode.c
3091: percpu_counter_read_positive(&sbi->s_freeclusters_counter);
3093: percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);
fs/btrfs/space-info.c
1028: ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
1029: delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);
fs/quota/dquot.c
808: percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));
fs/ext2/balloc.c
1162: free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
fs/ext2/ialloc.c
268: freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
270: free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
272: ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
fs/jbd2/journal.c
1263: count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1268: count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1287: count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
fs/xfs/xfs_ioctl.c
1149: .allocino = percpu_counter_read_positive(&mp->m_icount),
1150: .freeino = percpu_counter_read_positive(&mp->m_ifree),
fs/xfs/xfs_mount.h
736: return percpu_counter_read_positive(&mp->m_free[ctr].count);
fs/xfs/libxfs/xfs_ialloc.c
732: percpu_counter_read_positive(&args.mp->m_icount) + newlen >
1913: * Read rough value of mp->m_icount by percpu_counter_read_positive,
1917: percpu_counter_read_positive(&mp->m_icount) + igeo->ialloc_inos
mm/util.c
965: if (percpu_counter_read_positive(&vm_committed_as) < allowed)
drivers/md/dm-crypt.c
1734: if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) +
2750: * Note, percpu_counter_read_positive() may over (and under) estimate
2754: if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client) &&
io_uring/io_uring.c
2502: return percpu_counter_read_positive(&tctx->inflight);
include/linux/backing-dev.h
71: return percpu_counter_read_positive(&wb->stat[item]);
include/net/sock.h
1435: return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);
include/net/dst_ops.h
48: return percpu_counter_read_positive(&dst->pcpuc_entries);
There are probably other use-cases lurking out there. Anything
that requires sorting resources taken by objects used from many
CPUs concurrently in order to make decisions are good candidates.
The fact that the max inaccuracy is bounded means that we can perform
approximate comparisons in a fast path (which is enough if the values
to compare greatly vary), and only have to do the more expensive
"precise" comparison when values are close to each other and we
actually care about their relative order.
I'm thinking memory usage tracking, runtime usage (scheduler ?), cgroups
ressource accounting.
>
>> include/linux/percpu_counter_tree.h | 239 +++++++++++++++
>> init/main.c | 2 +
>> lib/Makefile | 1 +
>> lib/percpu_counter_tree.c | 443 ++++++++++++++++++++++++++++
>> 4 files changed, 685 insertions(+)
>> create mode 100644 include/linux/percpu_counter_tree.h
>> create mode 100644 lib/percpu_counter_tree.c
>
> An in-kernel test suite would be great. Like lib/*test*.c or
> tools/testing/.
I'll keep a note to port the tests from my userspace librseq
percpu counters feature branch to the kernel. I did not do it
initially because I wanted to see if the overall approach was
deemed interesting for the kernel.
>> +struct percpu_counter_tree {
>>...
>> +};
>
> I find that understanding the data structure leads to understanding the
> code, so additional documentation for the various fields would be
> helpful.
Will do.
>
>> +
>> +static inline
>> +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
>> +{
>> ...
>> +}
>> +
>> +static inline
>> +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
>> +{
>> ...
>> +}
(x86-64 asm below when expanding the inline functions into a real function)
00000000000006a0 <disasm_percpu_counter_tree_add>:
[...]
6a4: 8b 4f 08 mov 0x8(%rdi),%ecx
6a7: 48 8b 17 mov (%rdi),%rdx
6aa: 89 f0 mov %esi,%eax
6ac: 65 0f c1 02 xadd %eax,%gs:(%rdx)
6b0: 41 89 c8 mov %ecx,%r8d
6b3: 8d 14 06 lea (%rsi,%rax,1),%edx
6b6: 41 f7 d8 neg %r8d
6b9: 85 f6 test %esi,%esi
6bb: 78 18 js 6d5 <disasm_percpu_counter_tree_add+0x35>
6bd: 44 21 c6 and %r8d,%esi
6c0: 31 f0 xor %esi,%eax
6c2: 31 d0 xor %edx,%eax
6c4: 8d 14 31 lea (%rcx,%rsi,1),%edx
6c7: 85 c8 test %ecx,%eax
6c9: 0f 45 f2 cmovne %edx,%esi
6cc: 85 f6 test %esi,%esi
6ce: 75 1d jne 6ed <disasm_percpu_counter_tree_add+0x4d>
6d0: e9 00 00 00 00 jmp 6d5 <disasm_percpu_counter_tree_add+0x35>
6d5: f7 de neg %esi
6d7: 44 21 c6 and %r8d,%esi
6da: f7 de neg %esi
6dc: 31 f0 xor %esi,%eax
6de: 31 d0 xor %edx,%eax
6e0: 89 f2 mov %esi,%edx
6e2: 29 ca sub %ecx,%edx
6e4: 85 c8 test %ecx,%eax
6e6: 0f 45 f2 cmovne %edx,%esi
6e9: 85 f6 test %esi,%esi
6eb: 74 e3 je 6d0 <disasm_percpu_counter_tree_add+0x30>
6ed: e9 ae fb ff ff jmp 2a0 <percpu_counter_tree_add_slowpath>
>> +
>> +static inline
>> +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
>> +{
>> ...
>> +}
0000000000000710 <disasm_percpu_counter_tree_approximate_sum>:
[...]
714: 48 8b 47 10 mov 0x10(%rdi),%rax
718: 8b 10 mov (%rax),%edx
71a: 8b 47 18 mov 0x18(%rdi),%eax
71d: 01 d0 add %edx,%eax
71f: e9 00 00 00 00 jmp 724 <disasm_percpu_counter_tree_approximate_sum+0x14>
>
> These are pretty large after all the nested inlining is expanded. Are
> you sure that inlining them is the correct call?
percpu_counter_tree_add: 30 insn, 79 bytes
percpu_counter_tree_approximate_sum: 5 insn, 16 bytes
So I agree that inlining "add" may not be the right call there due
to the relatively large footprint of bitwise operations used to pick
the carry, and the fact that it contains a xadd which is ~9 cycles
on recent CPUs, so adding the overhead of a function call should not
be so significant.
Inlining "approximate_sum" seems to be the right call, because the
function is really small (16 bytes), and it only contains loads and
a "add", which are faster than a function call.
>
>
>> +#else /* !CONFIG_SMP */
>> +
>>
>> ...
>>
>> +#include <linux/percpu_counter_tree.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/percpu.h>
>> +#include <linux/atomic.h>
>> +#include <linux/errno.h>
>> +#include <linux/slab.h>
>> +#include <linux/math.h>
>> +
>> +#define MAX_NR_LEVELS 5
>> +
>> +struct counter_config {
>> + unsigned int nr_items;
>> + unsigned char nr_levels;
>> + unsigned char n_arity_order[MAX_NR_LEVELS];
>> +};
>> +
>> +/*
>> + * nr_items is the number of items in the tree for levels 1 to and
>> + * including the final level (approximate sum). It excludes the level 0
>> + * per-cpu counters.
>> + */
>
> That's referring to counter_config.nr_items? Comment appears to be
> misplaced.
Good point, I'll move it besides the "nr_items" field in counter_config.
I'll also document the other fields.
>>...
>> + /* Batch size must be power of 2 */
>> + if (!batch_size || (batch_size & (batch_size - 1)))
>> + return -EINVAL;
>
> It's a bug, yes? Worth a WARN?
I'll add a WARN_ON().
>>...
> It would be nice to kerneldocify the exported API.
OK
>
> Some fairly detailed words explaining the pros and cons of precise vs
> approximate would be helpful to people who are using this API.
Good point.
>> +int __init percpu_counter_tree_subsystem_init(void)
>
> I'm not sure that the "subsystem_" adds any value.
The symbol "percpu_counter_tree_init()" is already taken
for initialization of individual counter trees, so I added
"subsystem" here do distinguish between the two symbols.
Looking at kernel bootup functions "subsystem_init" seems
to be used at least once:
1108: acpi_subsystem_init();
>
>> +{
>> +
>> + nr_cpus_order = get_count_order(nr_cpu_ids);
>
> Stray newline.
OK
Thanks for the review!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists