[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com>
Date: Thu, 3 Apr 2025 13:59:39 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>,
Sweet Tea Dorminy <sweettea-kernel@...miny.me>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
Christian König <christian.koenig@....com>,
Johannes Weiner <hannes@...xchg.org>, Sweet Tea Dorminy
<sweettea@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
Christian Brauner <brauner@...nel.org>, Wei Yang
<richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>, Mateusz Guzik <mjguzik@...il.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of
percpu_counters
On 2025-04-02 20:00, Shakeel Butt wrote:
> On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
[...]
> I am still not buying the 'good performance' point. To me we might need
> to go with reduced batch size of existing approach or multi level
> approach from Mathieu (I still have to see Mateusz and Kairui's
> proposals).
Here is an initial userspace prototype of my hierarchical split counters:
https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h
https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c
How to try it out:
* Install liburcu
* Clone & build https://github.com/compudj/librseq branch: percpu-counter
(note: it's currently a POC, very lightly tested.)
Run tests/percpu_counter_test.tap :
ok 1 - Registered current thread with rseq
Counter init: approx: 0 precise: 0 inaccuracy: ±2048
Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048
Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048
ok 2 - Unregistered current thread with rseq
1..2
It implements the following operations:
Fast paths:
- counter_add
- counter_approx_sum
Function call APIs for:
- counter_add_slowpath (approximation propagation to levels > 0),
- counter_precise_sum (iterate on all per-cpu counters)
- counter_set: Set a bias to bring precise sum to a given target value.
- counter_inaccuracy: returns the maximum inaccuracy of approximation for this
counter configuration.
- counter_compare: Compare a counter against a value. Use approximation if value
is further than inaccuracy limits, else use precise sum.
Porting it to the Linux kernel and replacing lib/percpu_counter.c should be
straightforward. AFAIU, the only thing I have not implemented is a replacement
for percpu_counter_limited_add, and I'm not so sure how useful it is.
The most relevant piece of the algorithm is within counter_add as follows:
static inline
void counter_add(struct percpu_counter *counter, long inc)
{
unsigned long bit_mask = counter->level0_bit_mask;
intptr_t orig, res;
int ret, cpu;
if (!inc)
return;
// This is basically a percpu_add_return() in userspace with rseq.
do {
cpu = rseq_cpu_start();
orig = *rseq_percpu_ptr(counter->level0, cpu);
ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID,
rseq_percpu_ptr(counter->level0, cpu),
orig, orig + inc, cpu);
} while (ret);
res = orig + inc;
counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
inc, bit_mask, orig, res);
if (inc < 0) {
inc = -(-inc & ~((bit_mask << 1) - 1));
/* xor bit_mask, same sign: underflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc -= bit_mask;
} else {
inc &= ~((bit_mask << 1) - 1);
/* xor bit_mask, same sign: overflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc += bit_mask;
}
if (inc)
counter_add_slowpath(counter, inc);
}
void counter_add_slowpath(struct percpu_counter *counter, long inc)
{
struct percpu_counter_level_item *item = counter->items;
unsigned int level_items = counter->nr_cpus >> 1;
unsigned int level, nr_levels = counter->nr_levels;
long bit_mask = counter->level0_bit_mask;
int cpu = rseq_current_cpu_raw();
for (level = 1; level < nr_levels; level++) {
long orig, res;
long *count = &item[cpu & (level_items - 1)].count;
bit_mask <<= 1;
res = uatomic_add_return(count, inc, CMM_RELAXED);
orig = res - inc;
counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
level, inc, bit_mask, orig, res);
if (inc < 0) {
inc = -(-inc & ~((bit_mask << 1) - 1));
/* xor bit_mask, same sign: underflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc -= bit_mask;
} else {
inc &= ~((bit_mask << 1) - 1);
/* xor bit_mask, same sign: overflow */
if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
inc += bit_mask;
}
item += level_items;
level_items >>= 1;
if (!inc)
return;
}
counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc);
uatomic_add(&item->count, inc, CMM_RELAXED);
}
Feedback is welcome!
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists