[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bcebaacb-ef9e-409d-b770-3057a96c3d11@efficios.com>
Date: Tue, 8 Apr 2025 15:40:19 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Matthew Wilcox <willy@...radead.org>,
"Christoph Lameter (Ampere)" <cl@...two.org>,
Sweet Tea Dorminy <sweettea@...gle.com>, Mateusz Guzik <mjguzik@...il.com>,
linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
"Paul E. McKenney" <paulmck@...nel.org>, Steven Rostedt
<rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
christian.koenig@....com, Shakeel Butt <shakeel.butt@...ux.dev>,
Johannes Weiner <hannes@...xchg.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
Christian Brauner <brauner@...nel.org>, Wei Yang
<richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
Yu Zhao <yuzhao@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters
On 2025-04-08 13:41, Liam R. Howlett wrote:
> * Matthew Wilcox <willy@...radead.org> [250408 13:03]:
>> On Tue, Apr 08, 2025 at 09:37:18AM -0700, Christoph Lameter (Ampere) wrote:
>>>> The hierarchical per-CPU counters propagate a sum approximation through
>>>> a binary tree. When reaching the batch size, the carry is propagated
>>>> through a binary tree which consists of log2(nr_cpu_ids) levels. The
>>>> batch size for each level is twice the batch size of the prior level.
>>>
>>> A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
>>> on a 512 cpu machine. Given the inflation of the number of cpus this
>>> scheme better work up to 8K cpus.
>>
>> I find that a fan-out somewhere between 8 and 16 works well in practice.
>> log16(512) gives a 3 level tree as does a log8 tree. log16(8192) is a 4
>> level tree whereas log8(8192) is a 5 level tree. Not a big difference
>> either way.
>>
>> Somebody was trying to persuade me that a new tree type that maintained
>> additional information at each level of the tree to make some operations
>> log(log(N)) would be a better idea than a B-tree that is log(N). I
>> countered that a wider tree made the argument unsound at any size tree
>> up to 100k. And we don't tend to have _that_ many objects in a
>> data structure inside the kernel.
>
> I still maintain vEB trees are super cool, but I am glad we didn't try
> to implement an RCU safe version.
>
>>
>> ceil(log14(100,000)) = 5
>> ceil(log2(log2(100,000))) = 5
>>
>> at a million, there's actually a gap, 6 vs 5. But constant factors
>> become a much larger factor than scalability arguments at that point.
>
> In retrospect, it seems more of a math win than a practical win - and
> only really the O(n) bounds. Beyond what willy points out, writes
> rippling up the tree should be a concern for most users since it will
> impact the restart of readers and negatively affect the writer speed -
> but probably not here (hot plug?).
This implementation of hierarchical per-cpu counters is lock-free
for increment/decrement *and* for precise/approximate sums.
The increment/decrement use:
- this_cpu_add_return on the fast-path,
- atomic_add_return_relaxed for intermediate levels carry propagation,
- atomic_add for approximate sum updates.
The precise sum iterates on all possible cpus, loading their current
value. The approximate sum simply loads the current value of the
approximate sum.
So I'm unsure about your concern of writers restarting readers, because
this tree does not rely on mutual exclusion between updaters and
readers, nor does it rely on cmpxchg-based retry mechanisms in readers.
I agree with you that updates going all the way up the tree may
negatively affect the updater and approximate sum reader performance due
to bouncing of the counter cache line across CPUs.
>
> Working in (multiples of) cacheline sized b-tree nodes makes the most
> sense, in my experience.
I'm confused. Can you explain how this recommendation can practically
apply to the hierarchical counters ?
Thanks!
Mathieu
>
> Thanks,
> Liam
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists