[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7c660d7a-6c70-4307-895f-70d4aa274886@linux.ibm.com>
Date: Wed, 26 Mar 2025 18:24:37 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: André Almeida <andrealmeid@...lia.com>,
Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Valentin Schneider <vschneid@...hat.com>,
Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org,
"Nysal Jan K.A." <nysal@...ux.ibm.com>
Subject: Re: [PATCH v10 00/21] futex: Add support task local hash maps,
FUTEX2_NUMA and FUTEX2_MPOL
On 3/26/25 15:01, Sebastian Andrzej Siewior wrote:
> On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
>> Hi Sebastian.
> Hi Shrikanth,
>
Hi.
>> So, did some more bench-marking using the same perf futex hash.
>> I see that perf creates N threads and binds each thread to a CPU and then
>> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
>> only futex_hash is exercised.
>
> It also does spin_lock() + unlock on the hash bucket. Without the
> locking, you would have constant numbers.
>
Thanks for explanations.
Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case
probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline.
>> Numbers with different threads. (private futexes)
>> threads baseline with series (ratio)
>> 1 3386265 3266560 0.96
>> 10 1972069 821565 0.41
>> 40 1580497 277900 0.17
>> 80 1555482 150450 0.096
>>
>>
>> With Shared Futex: (-s option)
>> Threads baseline with series (ratio)
>> 80 590144 585067 0.99
>
> The shared numbers are equal since the code path there is unchanged.
>
>> After looking into code, and after some hacking, could get the
>> performance back with below change. this is likely functionally not correct.
>> the reason for below change is,
>>
>> 1. perf report showed significant time in futex_private_hash_put.
>> so removed rcu usage for users. that brought some improvements.
>> from 150k to 300k. Is there a better way to do this users protection?
>
> This is likely from the atomic dec operation itself. Then there is also
> the preemption counter operation. The inc should be also visible but
> might be inlined into the hash operation.
> This is _just_ the atomic inc/ dec that doubled the "throughput" but you
> don't have anything from the regular path.
> Anyway. To avoid the atomic part we would need to have a per-CPU counter
> instead of a global one and a more expensive slow path for the resize
> since you have to sum up all the per-CPU counters and so on. Not sure it
> is worth it.
>
resize would happen when one does prctl right? or
it can happen automatically too?
fph is going to be on thread leader's CPU and using atomics to do
fph->users would likely cause cacheline bouncing no?
Not sure if this happens only due to this benchmark which doesn't actually block.
Maybe the real life use-case this doesn't matter.
>> 2. Since number of buckets would be less by default, this would cause hb
>> collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>> bucket size what was before the series. That brought the numbers back to
>> 1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.
>
> Yes. The idea is to avoid a resize at runtime and setting to something
> you know best. You can also use it now to disable the private hash and
> stick with the global.
yes. SET_SLOTS would take care of it.
>
>> Note: Just increasing the hash bucket size without point 1, didn't matter much.
>
> Sebastian
Powered by blists - more mailing lists