linux-kernel - Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7c660d7a-6c70-4307-895f-70d4aa274886@linux.ibm.com>
Date: Wed, 26 Mar 2025 18:24:37 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: André Almeida <andrealmeid@...lia.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org,
        "Nysal Jan K.A." <nysal@...ux.ibm.com>
Subject: Re: [PATCH v10 00/21] futex: Add support task local hash maps,
 FUTEX2_NUMA and FUTEX2_MPOL



On 3/26/25 15:01, Sebastian Andrzej Siewior wrote:
> On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
>> Hi Sebastian.
> Hi Shrikanth,
> 

Hi.

>> So, did some more bench-marking using the same perf futex hash.
>> I see that perf creates N threads and binds each thread to a CPU and then
>> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
>> only futex_hash is exercised.
> 
> It also does spin_lock() + unlock on the hash bucket. Without the
> locking, you would have constant numbers.
> 
Thanks for explanations.

Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case
probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline.

>> Numbers with different threads. (private futexes)
>> threads	baseline		with series    (ratio)
>> 1		3386265			3266560		0.96	
>> 10		1972069			 821565		0.41
>> 40		1580497			 277900		0.17
>> 80		1555482			 150450		0.096
>>
>>
>> With Shared Futex: (-s option)
>> Threads	baseline		with series    (ratio)
>> 80		590144			 585067		0.99
> 
> The shared numbers are equal since the code path there is unchanged.
> 
>> After looking into code, and after some hacking, could get the
>> performance back with below change. this is likely functionally not correct.
>> the reason for below change is,
>>
>> 1. perf report showed significant time in futex_private_hash_put.
>>     so removed rcu usage for users. that brought some improvements.
>>     from 150k to 300k. Is there a better way to do this users protection?
> 
> This is likely from the atomic dec operation itself. Then there is also
> the preemption counter operation. The inc should be also visible but
> might be inlined into the hash operation.
> This is _just_ the atomic inc/ dec that doubled the "throughput" but you
> don't have anything from the regular path.
> Anyway. To avoid the atomic part we would need to have a per-CPU counter
> instead of a global one and a more expensive slow path for the resize
> since you have to sum up all the per-CPU counters and so on. Not sure it
> is worth it.
> 

resize would happen when one does prctl right? or
it can happen automatically too?

fph is going to be on thread leader's CPU and using atomics to do
fph->users would likely cause cacheline bouncing no?

Not sure if this happens only due to this benchmark which doesn't actually block.
Maybe the real life use-case this doesn't matter.

>> 2. Since number of buckets would be less by default, this would cause hb
>>     collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>>     bucket size what was before the series. That brought the numbers back to
>>     1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.
> 
> Yes. The idea is to avoid a resize at runtime and setting to something
> you know best. You can also use it now to disable the private hash and
> stick with the global.

yes. SET_SLOTS would take care of it.

> 
>> Note: Just increasing the hash bucket size without point 1, didn't matter much.
> 
> Sebastian