[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878qu5xjxz.ffs@tglx>
Date: Thu, 31 Oct 2024 00:14:16 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
linux-kernel@...r.kernel.org, André Almeida
<andrealmeid@...lia.com>,
Darren Hart <dvhart@...radead.org>, Davidlohr Bueso <dave@...olabs.net>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Valentin Schneider <vschneid@...hat.com>, Waiman Long <longman@...hat.com>
Subject: Re: [RFC PATCH 2/3] futex: Add basic infrastructure for local task
local hash.
On Wed, Oct 30 2024 at 22:08, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 01:02:34PM +0100, Thomas Gleixner wrote:
>
>> That's what we did with the original series, but with this model it's
>> daft. What we maybe could do there is:
>
> Not sure what's daft -- a single JVM running on 400+ CPUs with 4
> hashbuckets sounds awesome.
Hahaha. You're right and I discussed this with Sebastian earlier
today. He's going to do some research on scaling of the hash vs. number
of threads (manually for now).
Vs. daft: The original 'register first' model was trivial in that
regard. "Automagic" requires way more to get it right.
The global hash size is next_power_of_2(256 * num_possible CPUs). So
with 256 CPUs this ends up with 64k hash buckets (4M of memory).
As one thread of a multithreaded application can only have one futex
queued at a time, it's reasonable to size that according to the number
of threads.
But from my memory of investigating JVM and database muck there can be a
gazillion of futexes in the process. So depending on the scaling factor
(number of buckets per thread) the resulting hash distribution might be
truly bad and cause a vast amount of hash collisions if the hash size is
too small. Even if the bucket lock is not the problem, then the list
walk will show up as a cache miss sh*tshow.
Let's see what Sebastians experiments will unearth.
>> private_hash()
>> scoped_guard(rcu) {
>> hash = rcu_dereference(current->signal->futex_hash);
>
> So I really do think mm_struct is a better place for this than signal
> struct -- CLONE_SIGHAND is not mandatory when CLONE_VM.
>
> I've long forgotten which JVM used the naked CLONE_VM, but there is some
> creative code out there.
>
> And futexes fundamentally live in the memory address space.
Agreed.
>> if (hash && rcuref_get(&hash->ref))
>> return hash;
>> }
>>
>> guard(spinlock_irq)(&task->sighand->siglock);
>> hash = current->signal->futex_hash;
>> if (hash && rcuref_get(&hash->ref))
>> return hash;
>> // Let alloc scale according to signal->nr_threads
>
> mm->mm_users
>
>> // alloc acquires a reference count
>> ....
>
> It might make sense to have a prctl() setting that inhibits the hash
> allocation entirely, reverting back to the global hash tables.
Right. It might even be the default to begin with. See below.
> I'm not sure having that rehash under siglock is a fine idea. It's
> convenient, no doubt, but urgh, could get expensive.
If we move it to mm, then it will be a separate lock anyway, a mutex.
> Another scheme would be to have 2 concurrent hash-tables for a little
> while.
Thought about that before and discarded it as too complex, but now that
you mentioned it again, let me explain why and you can tell me that I'm
failing to see the obvious.
As that switch over needs to be lockless on the usage site, this opens
another can of worms.
struct mm_futex {
struct futex_hash __rcu *cur;
struct futex_hash __rcu *prev;
mutex_t mutex;
....
};
The reallocation part becomes obviously trivial
mutex_lock(&mm->futex.mutex);
old = mm->futex.curr;
new = alloc(...);
rcu_assign_pointer(mm->futex.prev, old);
rcu_assign_pointer(mm->futex.cur, new);
if (rcuref_put(old->ref)) {
// old is invalid now
rcu_assign_pointer(mm->futex.prev, NULL);
kfree_rcu(old);
}
mutex_unlock(&mm->futex.mutex);
So enqueue will always use mm->futex.cur which it sees when getting a
refcount. If the get() fails it has to retry. Trivial enough.
wake/requeue gets more interesting because it has to check the old hash
first ((if it exist)and move the entries in the hash bucket over,
otherwise we end up with dealing with that nonsense in the actual
wait/requeue guts. That needs to be done at _every_ operation ...
That might be actually doable (or not), but it requires a non-trivial
amount of complexity especially vs. the hb->waiters optimization.
If that's handled, then it falls flat when a futex is used as a wait
queue. There might be a thread sitting on it waiting for an event which
never happens, which means the old hash never gets removed. Then the
process gets another pile of threads and can't expand the hash because
there are already two instances.
Remember: Futex code consists of 5% functionality and 95% corner case
handling already. No need to up this to 98+% :)
I really want to start simple and let the process' futex usage come to a
grinding halt when the resizing and rehashing takes place.
Resizing won't happen every other millisecond especially as the hash
size has to double every time. And we'll never downsize.
Any smart application will resize on start or the admin will tweak the
hash size or disable it with whatever magic mechanism we will provide.
That said for the sake of least surprise, this might actually require to
be opt-in to begin with. At least at the system level it has to be
disabled by default (unless configured otherwise via config or command
line).
I really want to know who thought that futexes are a good idea to begin
with. Once we figured that out I need to find those who added all the
other complexity on top of that bad idea :)
Thanks,
tglx
Powered by blists - more mailing lists