linux-kernel - Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa6154d1-726c-4da1-a27b-69d2e8b449c6@meta.com>
Date: Thu, 5 Jun 2025 20:55:27 -0400
From: Chris Mason <clm@...a.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: futex performance regression from "futex: Allow automatic
 allocation of process wide futex hash"

On 6/4/25 4:08 PM, Sebastian Andrzej Siewior wrote:
> On 2025-06-04 11:48:05 [-0400], Chris Mason wrote:
>>> --- a/kernel/futex/core.c
>>> +++ b/kernel/futex/core.c
>>> @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
>>>  	scoped_guard(rcu) {
>>>  		threads = min_t(unsigned int,
>>>  				get_nr_threads(current),
>>> -				num_online_cpus());
>>> +				num_online_cpus() * 2);
>>>  
>>>  		fph = rcu_dereference(current->mm->futex_phash);
>>>  		if (fph) {
>>>
>>> which would increase it to 2048 as Chris asks for.
>>
>> I haven't followed these changes, so asking some extra questions.  This
>> would bump to num_online_cpus() * 2, which probably isn't 2048 right?
> 
> Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then
> threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4
> (roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets.
> 

Ah right, I missed the x4.

>> We've got large systems that are basically dedicated to single
>> workloads, and those will probably miss the larger global hash table,
>> regressing like schbench did.  Then we have large systems spread over
>> multiple big workloads that will love the private tables.
>>
>> In either case, I think growing the hash table as a multiple of thread
>> count instead of cpu count will probably better reflect the crazy things
>> multi-threaded applications do?  At any rate, I don't think we want
>> applications to need prctl to get back to the performance they had on
>> older kernels.
> 
> This is only an issue if all you CPUs spend their time in the kernel
> using the hash buckets at the same time.
> This was the case in every benchmark I've seen so far. Your thing might
> be closer to an actual workload.
> 

I didn't spend a ton of time looking at the perf profiles of the slower
kernel, was the bottleneck in the hash chain length or in contention for
the buckets?

>> For people that want to avoid that memory overhead, I'm assuming they
>> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
>> make that more clear.
>>
>>> Then there the possibility of 
> …
>>> 256 cores, 2xNUMA:
>>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
>>> | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
>>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
>> rps:   736 769.77 Futex HBs: 2048 immutable: 0
>>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
>>
>>
>> How long are these runs?  That's a huge benefit from being immutable
>> (1.5M vs 736K?) but the hash table churn should be confined to early in
>> the schbench run right?
> 
> I think 30 secs or so. I used your command line. 

Ah ok, my command line is 60 seconds.  It feels like something is
strange for the immutable flag to make it that much faster?  schbench
starts all the threads up front, so it should hit steady state pretty
quickly.  More on NUMA below, but I'll benchmark with the immutable flag
on the turin box in the morning to see if it is the extra atomics.

> The 256 core box showed
> a higher improvement than the 144 one. I attached a patch against
> schbench in the previous mail, I did then
> 	./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I
> 
> …
>> This schbench hunk is just testing the performance impact of different
>> bucket sizes, but hopefully we don't need it long term unless we want to
>> play with even bigger hash tables?
> 
> If you do "-H 0" then you should get the "old" behaviour. However the hash
> bucket are spread now over the NUMA nodes:
> 
> | dmesg |grep -i futex
> | [    0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear).
> 
> Now there are 32768 hash buckets on both NUMA nodes. Depending on the
> hash it computes, it uses the data structures on NUMA node 1 or 2. The
> old code allocated 65536 hash buckets via vmalloc().

So I wanted to mention this earlier and forgot, but schbench obviously
isn't numa aware at all.  This combination of command line options
basically just has the 4 message threads waking up the 256 worker
threads (per message thread) after scribbling a timestamp into shared
memory.  From a schbench pov at least, we'll get much more stable
numbers by sticking to a single socket.  <makes numa placement hand
gestures>

> 
> The bigger hash table isn't always the answer. Yes, you could play
> around figure out what works best for you. The problem is that the hash
> is based on the mm and the (user) memory address. So on each run you
> will get a different hash and therefore different collisions.
> If you end up having many hash collisions and then block on the same
> lock then yes, larger hash table will be the cure. If you have many
> threads doing the futex syscall simultaneously then making the hash
> immutable avoids two atomic ops on the same memory address.
> This would be my favorite.
> 
> Now that I think about it, it might be possible to move the atomic ops
> the hash bucket itself. Then it wouldn't bounce the cache line so much.
> Making the hash immutable is simpler.

Going back to your diff, if we have a process growing the total number
of threads, can we set FH_IMMUTABLE too early?  As the number of threads
increases, eventually we'll pick the 2x num_cpus, but that'll take a while?

I do see what you mean about immutable being simpler though, I'll get
some numbers on those atomics.

-chris