linux-kernel - Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250604200808.hqaWJdCo@linutronix.de>
Date: Wed, 4 Jun 2025 22:08:08 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Chris Mason <clm@...a.com>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: futex performance regression from "futex: Allow automatic
 allocation of process wide futex hash"

On 2025-06-04 11:48:05 [-0400], Chris Mason wrote:
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
> >  	scoped_guard(rcu) {
> >  		threads = min_t(unsigned int,
> >  				get_nr_threads(current),
> > -				num_online_cpus());
> > +				num_online_cpus() * 2);
> >  
> >  		fph = rcu_dereference(current->mm->futex_phash);
> >  		if (fph) {
> > 
> > which would increase it to 2048 as Chris asks for.
> 
> I haven't followed these changes, so asking some extra questions.  This
> would bump to num_online_cpus() * 2, which probably isn't 2048 right?

Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then
threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4
(roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets.

> We've got large systems that are basically dedicated to single
> workloads, and those will probably miss the larger global hash table,
> regressing like schbench did.  Then we have large systems spread over
> multiple big workloads that will love the private tables.
> 
> In either case, I think growing the hash table as a multiple of thread
> count instead of cpu count will probably better reflect the crazy things
> multi-threaded applications do?  At any rate, I don't think we want
> applications to need prctl to get back to the performance they had on
> older kernels.

This is only an issue if all you CPUs spend their time in the kernel
using the hash buckets at the same time.
This was the case in every benchmark I've seen so far. Your thing might
be closer to an actual workload.

> For people that want to avoid that memory overhead, I'm assuming they
> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
> make that more clear.
> 
> > Then there the possibility of 
…
> > 256 cores, 2xNUMA:
> > | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
> > | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
> > | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
> rps:   736 769.77 Futex HBs: 2048 immutable: 0
> > | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
> 
> 
> How long are these runs?  That's a huge benefit from being immutable
> (1.5M vs 736K?) but the hash table churn should be confined to early in
> the schbench run right?

I think 30 secs or so. I used your command line. The 256 core box showed
a higher improvement than the 144 one. I attached a patch against
schbench in the previous mail, I did then
	./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I

…
> This schbench hunk is just testing the performance impact of different
> bucket sizes, but hopefully we don't need it long term unless we want to
> play with even bigger hash tables?

If you do "-H 0" then you should get the "old" behaviour. However the hash
bucket are spread now over the NUMA nodes:

| dmesg |grep -i futex
| [    0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear).

Now there are 32768 hash buckets on both NUMA nodes. Depending on the
hash it computes, it uses the data structures on NUMA node 1 or 2. The
old code allocated 65536 hash buckets via vmalloc().

The bigger hash table isn't always the answer. Yes, you could play
around figure out what works best for you. The problem is that the hash
is based on the mm and the (user) memory address. So on each run you
will get a different hash and therefore different collisions.
If you end up having many hash collisions and then block on the same
lock then yes, larger hash table will be the cure. If you have many
threads doing the futex syscall simultaneously then making the hash
immutable avoids two atomic ops on the same memory address.
This would be my favorite.

Now that I think about it, it might be possible to move the atomic ops
the hash bucket itself. Then it wouldn't bounce the cache line so much.
Making the hash immutable is simpler.
 
> -chris

Sebastian