linux-kernel - Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250627122434.LtaTjk_B@linutronix.de>
Date: Fri, 27 Jun 2025 14:24:34 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Chris Mason <clm@...a.com>, linux-kernel@...r.kernel.org,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: futex performance regression from "futex: Allow automatic
 allocation of process wide futex hash"

On 2025-06-26 16:36:38 [+0200], Peter Zijlstra wrote:
> > I guess that is why you said earlier with to use srcu. For
> > things like "hackbench -T" you end up creating a new hash on every
> > thread creation which is not applied because RCU takes a while.
> > This could be optimized later by checking if the hash in futex_phash_new
> > matches the requested size.
> 
> So the main benefit of using SRCU (I have half a patch and a head-ache
> to show for it) is that you don't need the per-mm-per-cpu memory
> storage. The down-side is that you share the 'refcount' between all mm's
> so it gets to be even slower.

I see. So it is not just s/rcu/srcu/ for the hopefully quicker grace
period since it does not affect whole system. It a different algorithm.

> The 'problem' is that reasoning about stuff comes even harder. But the
> main idea is to replace mm->futex_phash with a 'tombstone' value to
> force __futex_hash() into taking the slow-path and hitting
> mm->futex_hash_lock.
> 
> Then you do call_srcu() to wait one SRCU period, this has everybody
> stalled, and guarantees the fph you had is now unused so we can rehash.
> Then replace the tombstone with the new hash and start over.
> 
> It's just that stuff like futex_hash_get() gets somewhat tricky, you
> have to ensure the SRCU periods overlap.
> 
> Anyway, that approach should be feasible as well, just not sure of the
> trade-offs.

The current looks reasonable enough and it survived yesterday's testing.

> > Why preempt_disable()? Is it just an optimized version of
> > rcu_read_lock()? I don't understand why. You don't even go for
> > __this_cpu_inc() so I a bit puzzled.
> 
> The code morphed a lot over the two days it took to write this. But
> yeah, preempt or rcu doesn't really matter here.
> 
> I didn't yet think about __this_cpu, its not really relevant on x86. If
> you want to make that relaxation you have to consider IRQ and SoftIRQ
> handling though. Given we have this_cpu_inc() in the RCU callback, which
> is ran from SoftIRQ context, this might not be safe.

Nah, I am all for a simple cguard(rcu) here.

Sebastian