linux-kernel - Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250626134820.ybEtTXSN@linutronix.de>
Date: Thu, 26 Jun 2025 15:48:20 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Chris Mason <clm@...a.com>, linux-kernel@...r.kernel.org,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: futex performance regression from "futex: Allow automatic
 allocation of process wide futex hash"

On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote:
> How about something like this (very lightly tested)...
> 
> the TL;DR is that it turns all those refcounts into per-cpu ops when
> there is no hash replacement pending (eg. the normal case), and only
> folds the lot into an atomic when we really care about it.

so we have per-CPU counter and on resize we wait one RCU grace period to
ensure everybody observed current fph, switch to atomics and wait one
grace period to ensure everyone is using atomics. Last step is to align
the atomic counter with the per-CPU counters and once the counter
reaches 0 perform the swap.

This looks fine :) Due to the RCU grace period, the swap takes longer
than before. I guess that is why you said earlier with to use srcu. For
things like "hackbench -T" you end up creating a new hash on every
thread creation which is not applied because RCU takes a while.
This could be optimized later by checking if the hash in futex_phash_new
matches the requested size.

> There's some sharp corners still.. but it boots and survives the
> (slightly modified) selftest.

The refcount does not pop up in perf so that is good.

> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
>  	fph = rcu_dereference_protected(mm->futex_phash,
>  					lockdep_is_held(&mm->futex_hash_lock));
>  	if (fph) {
> -		if (!rcuref_is_dead(&fph->users)) {
> +		if (!futex_ref_is_dead(fph)) {
>  			mm->futex_phash_new = new;
>  			return false;
>  		}
>  
>  		futex_rehash_private(fph, new);
>  	}
> -	rcu_assign_pointer(mm->futex_phash, new);
> +	new->state = FR_PERCPU;
> +	scoped_guard (rcu) {

We do space or we don't? It looks like sched/ does while the remaining
bits of the kernel mostly don't. I don't care but we could (later)
adjust it for futex towards one direction.

> +		mm->futex_batches = get_state_synchronize_rcu();
> +		rcu_assign_pointer(mm->futex_phash, new);
> +	}
>  	kvfree_rcu(fph, rcu);
>  	return true;
>  }
…
> +static void futex_ref_drop(struct futex_private_hash *fph)
…
> +	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);

Do you think it would improve with srcu or it is not worth it?

> +}
> +
> +static bool futex_ref_get(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_inc(*mm->futex_ref);
> +		return true;
> +	}
> +
> +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> +}
> +
> +static bool futex_ref_put(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_dec(*mm->futex_ref);
> +		return false;
> +	}
> +
> +	return atomic_long_dec_and_test(&mm->futex_atomic);
> +}
> +
> +static bool futex_ref_is_dead(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU)
> +		return false;
> +
> +	return atomic_long_read(&mm->futex_atomic) == 0;
> +}

Why preempt_disable()? Is it just an optimized version of
rcu_read_lock()? I don't understand why. You don't even go for
__this_cpu_inc() so I a bit puzzled.

Sebastian