linux-kernel - Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <969781b2-1bf4-4f8e-b694-452e593bb39a@paulmck-laptop>
Date: Thu, 6 Nov 2025 12:17:50 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	André Almeida <andrealmeid@...lia.com>,
	Darren Hart <dvhart@...radead.org>,
	Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
	Valentin Schneider <vschneid@...hat.com>,
	Waiman Long <longman@...hat.com>
Subject: Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

On Thu, Nov 06, 2025 at 12:23:39PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 06, 2025 at 12:09:07PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-11-06 10:29:29 [+0100], Peter Zijlstra wrote:
> > > Subject: futex: Optimize per-cpu reference counting
> > > From: Peter Zijlstra <peterz@...radead.org>
> > > Date: Wed, 16 Jul 2025 16:29:46 +0200
> > > 
> > > Shrikanth noted that the per-cpu reference counter was still some 10%
> > > slower than the old immutable option (which removes the reference
> > > counting entirely).
> > > 
> > > Further optimize the per-cpu reference counter by:
> > > 
> > >  - switching from RCU to preempt;
> > >  - using __this_cpu_*() since we now have preempt disabled;
> > >  - switching from smp_load_acquire() to READ_ONCE().
> > > 
> > > This is all safe because disabling preemption inhibits the RCU grace
> > > period exactly like rcu_read_lock().
> > > 
> > > Having preemption disabled allows using __this_cpu_*() provided the
> > > only access to the variable is in task context -- which is the case
> > > here.
> > 
> > Right. Read and Write from softirq happens after the user transitioned
> > to atomics.
> > 
> > > Furthermore, since we know changing fph->state to FR_ATOMIC demands a
> > > full RCU grace period we can rely on the implied smp_mb() from that to
> > > replace the acquire barrier().
> > 
> > That is the only part I struggle with but having a smp_mb() after a
> > grace period sounds reasonable.
> 
> IIRC the argument goes something like so:
> 
> A grace-period (for rcu-sched, which is implied by regular rcu)
> implies that every task has done at least one voluntary context switch.

Agreed, except for: s/voluntary context switch/context switch/

It is Tasks RCU that pays attention only to voluntary context switches.

> A context switch implies a full barrier.
> 
> Therefore observing a state change separated by a grace-period implies
> an smp_mb().

Just to be pedantic, for any given CPU and any given grace period,
it is the case that:

1.	That CPU will have executed a full barrier between any code
	executed on any CPU that happens before the beginning of that
	grace period and any RCU read-side critical section on that CPU
	that extends beyond the end of that grace period, and

2.	That CPU will have executed a full barrier between any RCU
	read-side critical section on that CPU that extends before the
	beginning of that grace period and any code executed on any CPU
	that happens after the end of that grace period.

An RCU read-side critical sections is: (1) Any region code protected by
rcu_read_lock() and friends, and (2) Any region of code where preemption
is disabled that does not contain a call to schedule().

							Thanx, Paul

> > > This is very similar to the percpu_down_read_internal() fast-path.
> > >
> > > The reason this is significant for PowerPC is that it uses the generic
> > > this_cpu_*() implementation which relies on local_irq_disable() (the
> > > x86 implementation relies on it being a single memop instruction to be
> > > IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids
> > > this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE
> > > barrier, not having to use explicit barriers safes a bunch.
> > > 
> > > Combined this reduces the performance gap by half, down to some 5%.
> > 
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> > 
> > Sebastian