[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b2fb5a99-8dc2-440b-bf52-1dbcf3d7d9a7@efficios.com>
Date: Mon, 3 Nov 2025 08:34:10 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: "Paul E. McKenney" <paulmck@...nel.org>, rcu@...r.kernel.org
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com, rostedt@...dmis.org,
Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
linux-arm-kernel@...ts.infradead.org, bpf@...r.kernel.org
Subject: Re: [PATCH 17/19] srcu: Optimize SRCU-fast-updown for arm64
On 2025-11-02 16:44, Paul E. McKenney wrote:
> Some arm64 platforms have slow per-CPU atomic operations, for example,
> the Neoverse V2. This commit therefore moves SRCU-fast from per-CPU
> atomic operations to interrupt-disabled non-read-modify-write-atomic
> atomic_read()/atomic_set() operations. This works because
> SRCU-fast-updown is not invoked from read-side primitives, which
> means that if srcu_read_unlock_fast() NMI handlers. This means that
> srcu_read_lock_fast_updown() and srcu_read_unlock_fast_updown() can
> exclude themselves and each other
>
> This reduces the overhead of calls to srcu_read_lock_fast_updown() and
> srcu_read_unlock_fast_updown() from about 100ns to about 12ns on an ARM
> Neoverse V2. Although this is not excellent compared to about 2ns on x86,
> it sure beats 100ns.
>
> This command was used to measure the overhead:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture refscale --allcpus --duration 5 --configs NOPREEMPT --kconfig "CONFIG_NR_CPUS=64 CONFIG_TASKS_TRACE_RCU=y" --bootargs "refscale.loops=100000 refscale.guest_os_delay=5 refscale.nreaders=64 refscale.holdoff=30 torture.disable_onoff_at_boot refscale.scale_type=srcu-fast-updown refscale.verbose_batched=8 torture.verbose_sleep_frequency=8 torture.verbose_sleep_duration=8 refscale.nruns=100" --trust-make
>
Hi Paul,
At a high level, what are you trying to achieve with this ?
AFAIU, you are trying to remove the cost of atomics on per-cpu
data from srcu-fast read lock/unlock for frequent calls for
CONFIG_NEED_SRCU_NMI_SAFE=y, am I on the right track ?
[disclaimer: I've looked only briefly at your proposed patch.]
Then there are various other less specific approaches to consider
before introducing such architecture and use-case specific work-around.
One example is the libside (user level) rcu implementation which uses
two counters per cpu [1]. One counter is the rseq fast path, and the
second counter is for atomics (as fallback).
If the typical scenario we want to optimize for is thread context, we
can probably remove the atomic from the fast path with just preempt off
by partitioning the per-cpu counters further, one possibility being:
struct percpu_srcu_fast_pair {
unsigned long lock, unlock;
};
struct percpu_srcu_fast {
struct percpu_srcu_fast_pair thread;
struct percpu_srcu_fast_pair irq;
};
And the grace period sums both thread and irq counters.
Thoughts ?
Thanks,
Mathieu
[1] https://github.com/compudj/libside/blob/master/src/rcu.h#L71
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists