lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55c89f03-6120-43d1-a620-46d8ca8aba4e@efficios.com>
Date: Thu, 3 Apr 2025 13:59:39 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>,
 Sweet Tea Dorminy <sweettea-kernel@...miny.me>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Dennis Zhou <dennis@...nel.org>,
 Tejun Heo <tj@...nel.org>, Christoph Lameter <cl@...ux.com>,
 Martin Liu <liumartin@...gle.com>, David Rientjes <rientjes@...gle.com>,
 Christian König <christian.koenig@....com>,
 Johannes Weiner <hannes@...xchg.org>, Sweet Tea Dorminy
 <sweettea@...gle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
 Christian Brauner <brauner@...nel.org>, Wei Yang
 <richard.weiyang@...il.com>, David Hildenbrand <david@...hat.com>,
 Miaohe Lin <linmiaohe@...wei.com>, Al Viro <viro@...iv.linux.org.uk>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
 Roman Gushchin <roman.gushchin@...ux.dev>, Mateusz Guzik <mjguzik@...il.com>
Subject: Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of
 percpu_counters

On 2025-04-02 20:00, Shakeel Butt wrote:
> On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
[...]
> I am still not buying the 'good performance' point. To me we might need
> to go with reduced batch size of existing approach or multi level
> approach from Mathieu (I still have to see Mateusz and Kairui's
> proposals).

Here is an initial userspace prototype of my hierarchical split counters:

https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h
https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c

How to try it out:

* Install liburcu
* Clone & build  https://github.com/compudj/librseq branch: percpu-counter

(note: it's currently a POC, very lightly tested.)

Run tests/percpu_counter_test.tap :

ok 1 - Registered current thread with rseq
Counter init: approx: 0 precise: 0 inaccuracy: ±2048
Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048
Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048
ok 2 - Unregistered current thread with rseq
1..2

It implements the following operations:

Fast paths:

- counter_add
- counter_approx_sum

Function call APIs for:

- counter_add_slowpath (approximation propagation to levels > 0),
- counter_precise_sum (iterate on all per-cpu counters)
- counter_set: Set a bias to bring precise sum to a given target value.
- counter_inaccuracy: returns the maximum inaccuracy of approximation for this
                       counter configuration.
- counter_compare: Compare a counter against a value. Use approximation if value
                    is further than inaccuracy limits, else use precise sum.

Porting it to the Linux kernel and replacing lib/percpu_counter.c should be
straightforward. AFAIU, the only thing I have not implemented is a replacement
for percpu_counter_limited_add, and I'm not so sure how useful it is.

The most relevant piece of the algorithm is within counter_add as follows:

static inline
void counter_add(struct percpu_counter *counter, long inc)
{
         unsigned long bit_mask = counter->level0_bit_mask;
         intptr_t orig, res;
         int ret, cpu;

         if (!inc)
                 return;
	// This is basically a percpu_add_return() in userspace with rseq.
         do {
                 cpu = rseq_cpu_start();
                 orig = *rseq_percpu_ptr(counter->level0, cpu);
                 ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID,
                                 rseq_percpu_ptr(counter->level0, cpu),
                                 orig, orig + inc, cpu);
         } while (ret);
         res = orig + inc;
         counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
                            inc, bit_mask, orig, res);
         if (inc < 0) {
                 inc = -(-inc & ~((bit_mask << 1) - 1));
                 /* xor bit_mask, same sign: underflow */
                 if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                         inc -= bit_mask;
         } else {
                 inc &= ~((bit_mask << 1) - 1);
                 /* xor bit_mask, same sign: overflow */
                 if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                         inc += bit_mask;
         }
         if (inc)
                 counter_add_slowpath(counter, inc);
}

void counter_add_slowpath(struct percpu_counter *counter, long inc)
{
         struct percpu_counter_level_item *item = counter->items;
         unsigned int level_items = counter->nr_cpus >> 1;
         unsigned int level, nr_levels = counter->nr_levels;
         long bit_mask = counter->level0_bit_mask;
         int cpu = rseq_current_cpu_raw();

         for (level = 1; level < nr_levels; level++) {
                 long orig, res;
                 long *count = &item[cpu & (level_items - 1)].count;

                 bit_mask <<= 1;
                 res = uatomic_add_return(count, inc, CMM_RELAXED);
                 orig = res - inc;
                 counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n",
                                    level, inc, bit_mask, orig, res);
                 if (inc < 0) {
                         inc = -(-inc & ~((bit_mask << 1) - 1));
                         /* xor bit_mask, same sign: underflow */
                         if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                                 inc -= bit_mask;
                 } else {
                         inc &= ~((bit_mask << 1) - 1);
                         /* xor bit_mask, same sign: overflow */
                         if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res))
                                 inc += bit_mask;
                 }
                 item += level_items;
                 level_items >>= 1;
                 if (!inc)
                         return;
         }
         counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc);
         uatomic_add(&item->count, inc, CMM_RELAXED);
}

Feedback is welcome!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ