[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71ea52f2-f6bf-4a55-84ba-d1442d13bc82@meta.com>
Date: Thu, 26 Jun 2025 07:01:23 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: futex performance regression from "futex: Allow automatic
allocation of process wide futex hash"
On 6/24/25 3:01 PM, Peter Zijlstra wrote:
> On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
>> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
>>>>> We've got large systems that are basically dedicated to single
>>>>> workloads, and those will probably miss the larger global hash table,
>>>>> regressing like schbench did. Then we have large systems spread over
>>>>> multiple big workloads that will love the private tables.
>>>>>
>>>>> In either case, I think growing the hash table as a multiple of thread
>>>>> count instead of cpu count will probably better reflect the crazy things
>>>>> multi-threaded applications do? At any rate, I don't think we want
>>>>> applications to need prctl to get back to the performance they had on
>>>>> older kernels.
>>>>
>>>> This is only an issue if all you CPUs spend their time in the kernel
>>>> using the hash buckets at the same time.
>>>> This was the case in every benchmark I've seen so far. Your thing might
>>>> be closer to an actual workload.
>>>>
>>>
>>> I didn't spend a ton of time looking at the perf profiles of the slower
>>> kernel, was the bottleneck in the hash chain length or in contention for
>>> the buckets?
>>
>> Every futex operation does a rcuref_get() (which is an atomic inc) on
>> the private hash. This is before anything else happens. If you have two
>> threads, on two CPUs, which simultaneously do a futex() operation then
>> both do this rcuref_get(). That atomic inc ensures that the cacheline
>> bounces from one CPU to the other. On the exit of the syscall there is a
>> matching rcuref_put().
>
> How about something like this (very lightly tested)...
>
> the TL;DR is that it turns all those refcounts into per-cpu ops when
> there is no hash replacement pending (eg. the normal case), and only
> folds the lot into an atomic when we really care about it.
>
> There's some sharp corners still.. but it boots and survives the
> (slightly modified) selftest.
I can get some benchmarks going of this, thanks. For 6.16, is the goal
to put something like this in, or default to the global hash table until
we've nailed it down?
I'd vote for defaulting to global for one more release.
-chris
Powered by blists - more mailing lists