[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250626143638.GM1613200@noisy.programming.kicks-ass.net>
Date: Thu, 26 Jun 2025 16:36:38 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Chris Mason <clm@...a.com>, linux-kernel@...r.kernel.org,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: futex performance regression from "futex: Allow automatic
allocation of process wide futex hash"
On Thu, Jun 26, 2025 at 03:48:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote:
> > How about something like this (very lightly tested)...
> >
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
>
> so we have per-CPU counter and on resize we wait one RCU grace period to
> ensure everybody observed current fph, switch to atomics and wait one
> grace period to ensure everyone is using atomics. Last step is to align
> the atomic counter with the per-CPU counters and once the counter
> reaches 0 perform the swap.
>
> This looks fine :) Due to the RCU grace period, the swap takes longer
> than before.
Yes, but given this is supposed to be 'rare', I didn't think that was a
problem.
> I guess that is why you said earlier with to use srcu. For
> things like "hackbench -T" you end up creating a new hash on every
> thread creation which is not applied because RCU takes a while.
> This could be optimized later by checking if the hash in futex_phash_new
> matches the requested size.
So the main benefit of using SRCU (I have half a patch and a head-ache
to show for it) is that you don't need the per-mm-per-cpu memory
storage. The down-side is that you share the 'refcount' between all mm's
so it gets to be even slower.
The 'problem' is that reasoning about stuff comes even harder. But the
main idea is to replace mm->futex_phash with a 'tombstone' value to
force __futex_hash() into taking the slow-path and hitting
mm->futex_hash_lock.
Then you do call_srcu() to wait one SRCU period, this has everybody
stalled, and guarantees the fph you had is now unused so we can rehash.
Then replace the tombstone with the new hash and start over.
It's just that stuff like futex_hash_get() gets somewhat tricky, you
have to ensure the SRCU periods overlap.
Anyway, that approach should be feasible as well, just not sure of the
trade-offs.
> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
>
> The refcount does not pop up in perf so that is good.
\o/
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
> > fph = rcu_dereference_protected(mm->futex_phash,
> > lockdep_is_held(&mm->futex_hash_lock));
> > if (fph) {
> > - if (!rcuref_is_dead(&fph->users)) {
> > + if (!futex_ref_is_dead(fph)) {
> > mm->futex_phash_new = new;
> > return false;
> > }
> >
> > futex_rehash_private(fph, new);
> > }
> > - rcu_assign_pointer(mm->futex_phash, new);
> > + new->state = FR_PERCPU;
> > + scoped_guard (rcu) {
>
> We do space or we don't? It looks like sched/ does while the remaining
> bits of the kernel mostly don't. I don't care but we could (later)
> adjust it for futex towards one direction.
Oh, scoped_guard is for and we write for loops with a space on. Perhaps
its because I wrote it all and know this. But yeah, meh :-)
> > + mm->futex_batches = get_state_synchronize_rcu();
> > + rcu_assign_pointer(mm->futex_phash, new);
> > + }
> > kvfree_rcu(fph, rcu);
> > return true;
> > }
> …
> > +static void futex_ref_drop(struct futex_private_hash *fph)
> …
> > + call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
>
> Do you think it would improve with srcu or it is not worth it?
Per the above, I'm not sure about the trade-offs. I think the SRCU
approach can be made to work -- I just got me head in a twist.
> > +}
> > +
> > +static bool futex_ref_get(struct futex_private_hash *fph)
> > +{
> > + struct mm_struct *mm = fph->mm;
> > +
> > + guard(preempt)();
> > +
> > + if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > + this_cpu_inc(*mm->futex_ref);
> > + return true;
> > + }
> > +
> > + return atomic_long_inc_not_zero(&mm->futex_atomic);
> > +}
> > +
> > +static bool futex_ref_put(struct futex_private_hash *fph)
> > +{
> > + struct mm_struct *mm = fph->mm;
> > +
> > + guard(preempt)();
> > +
> > + if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > + this_cpu_dec(*mm->futex_ref);
> > + return false;
> > + }
> > +
> > + return atomic_long_dec_and_test(&mm->futex_atomic);
> > +}
> > +
> > +static bool futex_ref_is_dead(struct futex_private_hash *fph)
> > +{
> > + struct mm_struct *mm = fph->mm;
> > +
> > + guard(preempt)();
> > +
> > + if (smp_load_acquire(&fph->state) == FR_PERCPU)
> > + return false;
> > +
> > + return atomic_long_read(&mm->futex_atomic) == 0;
> > +}
>
> Why preempt_disable()? Is it just an optimized version of
> rcu_read_lock()? I don't understand why. You don't even go for
> __this_cpu_inc() so I a bit puzzled.
The code morphed a lot over the two days it took to write this. But
yeah, preempt or rcu doesn't really matter here.
I didn't yet think about __this_cpu, its not really relevant on x86. If
you want to make that relaxation you have to consider IRQ and SoftIRQ
handling though. Given we have this_cpu_inc() in the RCU callback, which
is ran from SoftIRQ context, this might not be safe.
Powered by blists - more mailing lists