linux-kernel - Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250626131715.GI1613200@noisy.programming.kicks-ass.net>
Date: Thu, 26 Jun 2025 15:17:15 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Chris Mason <clm@...a.com>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: futex performance regression from "futex: Allow automatic
 allocation of process wide futex hash"

On Thu, Jun 26, 2025 at 07:01:23AM -0400, Chris Mason wrote:
> On 6/24/25 3:01 PM, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
> >> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
> >>>>> We've got large systems that are basically dedicated to single
> >>>>> workloads, and those will probably miss the larger global hash table,
> >>>>> regressing like schbench did.  Then we have large systems spread over
> >>>>> multiple big workloads that will love the private tables.
> >>>>>
> >>>>> In either case, I think growing the hash table as a multiple of thread
> >>>>> count instead of cpu count will probably better reflect the crazy things
> >>>>> multi-threaded applications do?  At any rate, I don't think we want
> >>>>> applications to need prctl to get back to the performance they had on
> >>>>> older kernels.
> >>>>
> >>>> This is only an issue if all you CPUs spend their time in the kernel
> >>>> using the hash buckets at the same time.
> >>>> This was the case in every benchmark I've seen so far. Your thing might
> >>>> be closer to an actual workload.
> >>>>
> >>>
> >>> I didn't spend a ton of time looking at the perf profiles of the slower
> >>> kernel, was the bottleneck in the hash chain length or in contention for
> >>> the buckets?
> >>
> >> Every futex operation does a rcuref_get() (which is an atomic inc) on
> >> the private hash. This is before anything else happens. If you have two
> >> threads, on two CPUs, which simultaneously do a futex() operation then
> >> both do this rcuref_get(). That atomic inc ensures that the cacheline
> >> bounces from one CPU to the other. On the exit of the syscall there is a
> >> matching rcuref_put().
> > 
> > How about something like this (very lightly tested)...
> > 
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
> > 
> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
> 
> I can get some benchmarks going of this, thanks.  For 6.16, is the goal
> to put something like this in, or default to the global hash table until
> we've nailed it down?
> 
> I'd vote for defaulting to global for one more release.

Probably best to do that; means we don't have to rush crazy code :-)