linux-kernel - Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ae8c6fd5-cc9c-44f3-a489-0346873f4be5@linux.ibm.com>
Date: Wed, 16 Jul 2025 23:51:46 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        André Almeida <andrealmeid@...lia.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
        Valentin Schneider <vschneid@...hat.com>,
        Waiman Long <longman@...hat.com>
Subject: Re: [PATCH v2 0/6] futex: Use RCU-based per-CPU reference counting



On 7/16/25 19:59, Peter Zijlstra wrote:
> On Tue, Jul 15, 2025 at 10:34:24PM +0530, Shrikanth Hegde wrote:
> 
>> I did try again by going to baseline, removed BROKEN and ran below. Which gives us immutable numbers.
>> ./perf bench futex hash -Ib512
>> Averaged 1536035 operations/sec (+- 0.11%), total secs = 10
>> Futex hashing: 512 hash buckets (immutable)
>>
>> So, with -b 512 option, it is around 8-10% less compared to immutable.
> 
> Urgh, can you run perf on that and tell me if this is due to
> this_cpu_{inc,dec}() doing local_irq_disable() or the smp_load_acquire()
> doing LWSYNC ?

It seems like due to rcu and irq enable.
Both perf records are collected with -b512.


base_futex_immutable_b512 - perf record collected with baseline + remove BROKEN + ./perf bench futex hash -Ib512
per_cpu_futex_hash_b_512 - baseline + series + ./perf bench futex hash -b512


perf diff base_futex_immutable_b512 per_cpu_futex_hash_b_512
# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object               Symbol
# ........  .........  ..........................  ....................................................
#
     21.62%     -2.26%  [kernel.vmlinux]            [k] futex_get_value_locked
      0.16%     +2.01%  [kernel.vmlinux]            [k] __rcu_read_unlock
      1.35%     +1.63%  [kernel.vmlinux]            [k] arch_local_irq_restore.part.0
                +1.48%  [kernel.vmlinux]            [k] futex_private_hash_put
                +1.16%  [kernel.vmlinux]            [k] futex_ref_get
     10.41%     -0.78%  [kernel.vmlinux]            [k] system_call_vectored_common
      1.24%     +0.72%  perf                        [.] workerfn
      5.32%     -0.66%  [kernel.vmlinux]            [k] futex_q_lock
      2.48%     -0.43%  [kernel.vmlinux]            [k] futex_wait
      2.47%     -0.40%  [kernel.vmlinux]            [k] _raw_spin_lock
      2.98%     -0.35%  [kernel.vmlinux]            [k] futex_q_unlock
      2.42%     -0.34%  [kernel.vmlinux]            [k] __futex_wait
      5.47%     -0.32%  libc.so.6                   [.] syscall
      4.03%     -0.32%  [kernel.vmlinux]            [k] memcpy_power7
      0.16%     +0.22%  [kernel.vmlinux]            [k] arch_local_irq_restore
      5.93%     -0.18%  [kernel.vmlinux]            [k] futex_hash
      1.72%     -0.17%  [kernel.vmlinux]            [k] sys_futex


> 
> Anyway, I think we can improve both. Does the below help?
> 
> 
> ---
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index d9bb5567af0c..8c41d050bd1f 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1680,10 +1680,10 @@ static bool futex_ref_get(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_inc(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_inc(*mm->futex_ref);
>   		return true;
>   	}
>   
> @@ -1694,10 +1694,10 @@ static bool futex_ref_put(struct futex_private_hash *fph)
>   {
>   	struct mm_struct *mm = fph->mm;
>   
> -	guard(rcu)();
> +	guard(preempt)();
>   
> -	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> -		this_cpu_dec(*mm->futex_ref);
> +	if (READ_ONCE(fph->state) == FR_PERCPU) {
> +		__this_cpu_dec(*mm->futex_ref);
>   		return false;
>   	}
>   

Yes. It helps. It improves "-b 512" numbers by at-least 5%.

baseline + series:
Averaged 1412543 operations/sec (+- 0.14%), total secs = 10
Futex hashing: 512 hash buckets


baseline + series+ above_patch:
Averaged 1482733 operations/sec (+- 0.26%), total secs = 10   <<< 5% improvement
Futex hashing: 512 hash buckets


Now we are closer baseline/immutable by 4-5%.
baseline:
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (HEAD)

./perf bench futex hash
Averaged 1559643 operations/sec (+- 0.09%), total secs = 10
Futex hashing: global hash