[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250710080859.LsLwRH_j@linutronix.de>
Date: Thu, 10 Jul 2025 10:08:59 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org,
André Almeida <andrealmeid@...lia.com>,
Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Valentin Schneider <vschneid@...hat.com>,
Waiman Long <longman@...hat.com>
Subject: Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU
related changes
On 2025-07-08 10:01:42 [+0200], To Peter Zijlstra wrote:
> > So the auto scaling thing is 'broken' in that if you do a 'final'
> > pthread_create() it will try and stage this new hash. If for whatever
> > reason the refcount isn't '0' -- and this can already happen today due
> > to a concurrent futex operation. Nothing will re-try the rehash.
>
> Sure it was there but not in the way the test case was setup. I *think*
> it is okay because in real life the new hash will be put to use unless
> you terminate shortly after at which point you don't need it.
>
> > This RCU business made this all but a certainty, but the race was there.
> >
> > I briefly wondered if we should have a workqueue re-try the rehash.
>
> I don't think we need to worry about it.
Just to replace the *think* arguments with some technical foundation: I
added a few trace printks to see how it going and this came out:
| zstd-2308004 [012] ..... 01.857262: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff88854ed3f000 (16)
first thread/ assignment
| zstd-2308004 [012] ..... 01.858722: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
| zstd-2308004 [012] ..... 01.858760: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
| zstd-2308004 [012] ..... 01.858792: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
| zstd-2308004 [012] ..... 01.858830: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
| zstd-2308004 [012] ..... 01.858865: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.858913: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.858963: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859019: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859057: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859092: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859125: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859160: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
| zstd-2308004 [012] ..... 01.859202: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff8885b65c8000
Another and another thread is created. So it ends up allocating a new
futex_private_hash because current mask is less than what we want to
have. We then acquire mm__struct::futex_hash_lock, notice that
mm_struct::futex_phash_new is already set (pending in the trace) and
free the newly allocated one.
As an optimization we could check mm_struct::futex_phash_new under the
lock and skip the allocation if the already fph is okay. We will acquire
the lock for assignment anyway so we could skip the allocation in this
case. As you see the pending address changed a few times because as the
number of threads increased, the size also increased three times.
| <idle>-0 [012] ..s1. 01.890423: futex_ref_rcu: mm ffff88854c3f9b80 [0] ffff88854ed3f000
| <idle>-0 [012] ..s1. 01.926421: futex_ref_rcu: mm ffff88854c3f9b80 [1] ffff88854ed3f000
the state transition, two times because the setting up threads was so
quick.
| zstd-2308007 [031] ..... 02.724004: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff8885b65c8000 (128)
almost a second later the first futex op led to the assignment of the hash.
| zstd-2308004 [012] ..... 41.162795: futex_hash_free: mm ffff88854c3f9b80 fph ffff8885b65c8000 new 0000000000000000
almost 40 secs, the application is done.
This is an example for let me setup the threads, I need them for a while.
A different one:
| lz4-2309026 [008] ..... 2.654404: __futex_pivot_hash: mm ffff888664977380 fph ffff8885b2000800 (16)
first assignment.
| lz4-2309026 [008] ..... 2.654597: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
| lz4-2309026 [008] ..... 2.654641: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
| lz4-2309026 [008] ..... 2.654690: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
| lz4-2309026 [008] ..... 2.654732: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
| lz4-2309026 [008] ..... 2.654774: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.654812: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.654847: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.654884: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.654919: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.654962: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.655002: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.655039: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
| lz4-2309026 [008] ..... 2.655202: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655250: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655288: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655329: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655370: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655411: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655449: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655488: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655525: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655562: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
| lz4-2309026 [008] ..... 2.655603: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
a bunch of threads is created.
| <idle>-0 [008] ..s1. 2.689306: futex_ref_rcu: mm ffff888664977380 [0] ffff8885b2000800
| <idle>-0 [008] ..s1. 2.725324: futex_ref_rcu: mm ffff888664977380 [1] ffff8885b2000800
state transition over
|kworker/8:38-2307372 [008] ..... 2.727104: futex_hash_free: mm ffff888664977380 fph ffff8885b2000800 new ffff8885ac4b4000
The hash is released from the kworker due to mmput_async(). It frees the
current and the new hash which was never assigned.
I have a few examples which are similar to the lz4 where the new hash
was not assigned either because the application terminated early or it
did not invoke any futex opcodes after that.
Therefore it does not make sense to assign the new hash once it is
possible and mobilize a kworker for it. Avoiding the alloc/ free dance
would make sense.
Sebastian
Powered by blists - more mailing lists