linux-kernel - Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3652f6b-cd3b-45ad-a3ad-e92369cf4068@efficios.com>
Date: Sat, 22 Nov 2025 12:15:02 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, "Paul E. McKenney" <paulmck@...nel.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
 SeongJae Park <sj@...nel.org>, Michal Hocko <mhocko@...e.com>,
 Johannes Weiner <hannes@...xchg.org>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <liam.howlett@...cle.com>, Mike Rapoport
 <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Christian Brauner <brauner@...nel.org>,
 Wei Yang <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
 Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Mateusz Guzik <mjguzik@...il.com>, Matthew Wilcox <willy@...radead.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>
Subject: Re: [PATCH v9 1/2] lib: Introduce hierarchical per-cpu counters

On 2025-11-21 13:03, Andrew Morton wrote:
> On Thu, 20 Nov 2025 16:03:53 -0500 Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>> ...
>> It aims at fixing the per-mm RSS tracking which has become too
>> inaccurate for OOM killer purposes on large many-core systems [1].
> 
> Presentation nit: the info at [1] is rather well hidden until one reads
> the [2/2] changelog.  You might want to move that material into the
> [0/N] - after all, it's the entire point of the patchset.

Will do.

>> The hierarchical per-CPU counters propagate a sum approximation through
>> a N-way tree. When reaching the batch size, the carry is propagated
>> through a binary tree which consists of logN(nr_cpu_ids) levels. The
>> batch size for each level is twice the batch size of the prior level.
>> ...

> Looks very neat.

I'm glad you like the approach :) When looking at this problem, a
hierarchical tree approach seemed like a good candidate to solve the
general problem.

> 
> Have you identified other parts of the kernel which could use this?

I have not surveyed other possible users yet. At first glance, users
of the linux/percpu_counter.h percpu_counter_read_positive() may
be good candidates for the hierarchical tree. Here is a list:

lib/flex_proportions.c
157:		num = percpu_counter_read_positive(&pl->events);
158:		den = percpu_counter_read_positive(&p->events);

fs/file_table.c
88:	return percpu_counter_read_positive(&nr_files);

fs/ext4/balloc.c
627:	free_clusters  = percpu_counter_read_positive(fcc);
628:	dirty_clusters = percpu_counter_read_positive(dcc);

fs/ext4/ialloc.c
448:	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
450:	freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
453:	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/ext4/extents_status.c
1682:	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1694:	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
1699:	ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);

fs/ext4/inode.c
3091:		percpu_counter_read_positive(&sbi->s_freeclusters_counter);
3093:		percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);

fs/btrfs/space-info.c
1028:	ordered = percpu_counter_read_positive(&fs_info->ordered_bytes) >> 1;
1029:	delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);

fs/quota/dquot.c
808:	percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS]));

fs/ext2/balloc.c
1162:	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);

fs/ext2/ialloc.c
268:	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
270:	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
272:	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

fs/jbd2/journal.c
1263:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1268:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
1287:	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);

fs/xfs/xfs_ioctl.c
1149:		.allocino = percpu_counter_read_positive(&mp->m_icount),
1150:		.freeino  = percpu_counter_read_positive(&mp->m_ifree),

fs/xfs/xfs_mount.h
736:	return percpu_counter_read_positive(&mp->m_free[ctr].count);

fs/xfs/libxfs/xfs_ialloc.c
732:	    percpu_counter_read_positive(&args.mp->m_icount) + newlen >
1913:	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
1917:	    percpu_counter_read_positive(&mp->m_icount) + igeo->ialloc_inos

mm/util.c
965:	if (percpu_counter_read_positive(&vm_committed_as) < allowed)

drivers/md/dm-crypt.c
1734:			if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) +
2750:	 * Note, percpu_counter_read_positive() may over (and under) estimate
2754:	if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) >= dm_crypt_pages_per_client) &&

io_uring/io_uring.c
2502:	return percpu_counter_read_positive(&tctx->inflight);

include/linux/backing-dev.h
71:	return percpu_counter_read_positive(&wb->stat[item]);

include/net/sock.h
1435:	return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);

include/net/dst_ops.h
48:	return percpu_counter_read_positive(&dst->pcpuc_entries);

There are probably other use-cases lurking out there. Anything
that requires sorting resources taken by objects used from many
CPUs concurrently in order to make decisions are good candidates.

The fact that the max inaccuracy is bounded means that we can perform
approximate comparisons in a fast path (which is enough if the values
to compare greatly vary), and only have to do the more expensive
"precise" comparison when values are close to each other and we
actually care about their relative order.

I'm thinking memory usage tracking, runtime usage (scheduler ?), cgroups
ressource accounting.

> 
>>   include/linux/percpu_counter_tree.h | 239 +++++++++++++++
>>   init/main.c                         |   2 +
>>   lib/Makefile                        |   1 +
>>   lib/percpu_counter_tree.c           | 443 ++++++++++++++++++++++++++++
>>   4 files changed, 685 insertions(+)
>>   create mode 100644 include/linux/percpu_counter_tree.h
>>   create mode 100644 lib/percpu_counter_tree.c
> 
> An in-kernel test suite would be great.  Like lib/*test*.c or
> tools/testing/.

I'll keep a note to port the tests from my userspace librseq
percpu counters feature branch to the kernel. I did not do it
initially because I wanted to see if the overall approach was
deemed interesting for the kernel.

>> +struct percpu_counter_tree {
>>...
>> +};
> 
> I find that understanding the data structure leads to understanding the
> code, so additional documentation for the various fields would be
> helpful.
Will do.

> 
>> +
>> +static inline
>> +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
>> +{
>> ...
>> +}
>> +
>> +static inline
>> +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
>> +{
>> ...
>> +}

(x86-64 asm below when expanding the inline functions into a real function)

00000000000006a0 <disasm_percpu_counter_tree_add>:
[...]
  6a4:   8b 4f 08                mov    0x8(%rdi),%ecx
  6a7:   48 8b 17                mov    (%rdi),%rdx
  6aa:   89 f0                   mov    %esi,%eax
  6ac:   65 0f c1 02             xadd   %eax,%gs:(%rdx)
  6b0:   41 89 c8                mov    %ecx,%r8d
  6b3:   8d 14 06                lea    (%rsi,%rax,1),%edx
  6b6:   41 f7 d8                neg    %r8d
  6b9:   85 f6                   test   %esi,%esi
  6bb:   78 18                   js     6d5 <disasm_percpu_counter_tree_add+0x35>
  6bd:   44 21 c6                and    %r8d,%esi
  6c0:   31 f0                   xor    %esi,%eax
  6c2:   31 d0                   xor    %edx,%eax
  6c4:   8d 14 31                lea    (%rcx,%rsi,1),%edx
  6c7:   85 c8                   test   %ecx,%eax
  6c9:   0f 45 f2                cmovne %edx,%esi
  6cc:   85 f6                   test   %esi,%esi
  6ce:   75 1d                   jne    6ed <disasm_percpu_counter_tree_add+0x4d>
  6d0:   e9 00 00 00 00          jmp    6d5 <disasm_percpu_counter_tree_add+0x35>
  6d5:   f7 de                   neg    %esi
  6d7:   44 21 c6                and    %r8d,%esi
  6da:   f7 de                   neg    %esi
  6dc:   31 f0                   xor    %esi,%eax
  6de:   31 d0                   xor    %edx,%eax
  6e0:   89 f2                   mov    %esi,%edx
  6e2:   29 ca                   sub    %ecx,%edx
  6e4:   85 c8                   test   %ecx,%eax
  6e6:   0f 45 f2                cmovne %edx,%esi
  6e9:   85 f6                   test   %esi,%esi
  6eb:   74 e3                   je     6d0 <disasm_percpu_counter_tree_add+0x30>
  6ed:   e9 ae fb ff ff          jmp    2a0 <percpu_counter_tree_add_slowpath>

>> +
>> +static inline
>> +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter)
>> +{
>> ...
>> +}

0000000000000710 <disasm_percpu_counter_tree_approximate_sum>:
[...]
  714:   48 8b 47 10             mov    0x10(%rdi),%rax
  718:   8b 10                   mov    (%rax),%edx
  71a:   8b 47 18                mov    0x18(%rdi),%eax
  71d:   01 d0                   add    %edx,%eax
  71f:   e9 00 00 00 00          jmp    724 <disasm_percpu_counter_tree_approximate_sum+0x14>

> 
> These are pretty large after all the nested inlining is expanded.  Are
> you sure that inlining them is the correct call?

percpu_counter_tree_add: 30 insn, 79 bytes
percpu_counter_tree_approximate_sum: 5 insn, 16 bytes

So I agree that inlining "add" may not be the right call there due
to the relatively large footprint of bitwise operations used to pick
the carry, and the fact that it contains a xadd which is ~9 cycles
on recent CPUs, so adding the overhead of a function call should not
be so significant.

Inlining "approximate_sum" seems to be the right call, because the
function is really small (16 bytes), and it only contains loads and
a "add", which are faster than a function call.

> 
> 
>> +#else	/* !CONFIG_SMP */
>> +
>>
>> ...
>>
>> +#include <linux/percpu_counter_tree.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/percpu.h>
>> +#include <linux/atomic.h>
>> +#include <linux/errno.h>
>> +#include <linux/slab.h>
>> +#include <linux/math.h>
>> +
>> +#define MAX_NR_LEVELS 5
>> +
>> +struct counter_config {
>> +	unsigned int nr_items;
>> +	unsigned char nr_levels;
>> +	unsigned char n_arity_order[MAX_NR_LEVELS];
>> +};
>> +
>> +/*
>> + * nr_items is the number of items in the tree for levels 1 to and
>> + * including the final level (approximate sum). It excludes the level 0
>> + * per-cpu counters.
>> + */
> 
> That's referring to counter_config.nr_items?  Comment appears to be
> misplaced.

Good point, I'll move it besides the "nr_items" field in counter_config.
I'll also document the other fields.

>>...
>> +	/* Batch size must be power of 2 */
>> +	if (!batch_size || (batch_size & (batch_size - 1)))
>> +		return -EINVAL;
> 
> It's a bug, yes?  Worth a WARN?

I'll add a WARN_ON().

>>...
> It would be nice to kerneldocify the exported API.

OK

> 
> Some fairly detailed words explaining the pros and cons of precise vs
> approximate would be helpful to people who are using this API.

Good point.

>> +int __init percpu_counter_tree_subsystem_init(void)
> 
> I'm not sure that the "subsystem_" adds any value.

The symbol "percpu_counter_tree_init()" is already taken
for initialization of individual counter trees, so I added
"subsystem" here do distinguish between the two symbols.
Looking at kernel bootup functions "subsystem_init" seems
to be used at least once:

1108:	acpi_subsystem_init();

> 
>> +{
>> +
>> +	nr_cpus_order = get_count_order(nr_cpu_ids);
> 
> Stray newline.

OK

Thanks for the review!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com