linux-kernel - Re: [tip: locking/urgent] futex: Allow to resize the private local hash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aFEz_Fzr-_-nGAHV@mozart.vkv.me>
Date: Tue, 17 Jun 2025 02:23:08 -0700
From: Calvin Owens <calvin@...nvd.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
	"Lai, Yi" <yi1.lai@...ux.intel.com>,
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: locking/urgent] futex: Allow to resize the private local
 hash

On Tuesday 06/17 at 09:16 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-16 10:14:24 [-0700], Calvin Owens wrote:
> > On Wednesday 06/11 at 14:39 -0000, tip-bot2 for Sebastian Andrzej Siewior wrote:
> > > <snip> 
> > > It is possible that two threads simultaneously request the global hash
> > > and both pass the initial check and block later on the
> > > mm::futex_hash_lock. In this case the first thread performs the switch
> > > to the global hash. The second thread will also attempt to switch to the
> > > global hash and while doing so, accessing the nonexisting slot 1 of the
> > > struct futex_private_hash.
> > 
> > In case it's interesting to anyone, I'm hitting this one in real life,
> > one of my build machines got stuck overnight:
> 
> The scenario described in the description is not something that happens
> on its own. The bot explicitly "asked" for it. This won't happen in a
> "normal" scenario where you do not explicitly ask for specific hash via
> the prctl() interface.

Ugh, I'm sorry, I was in too much of a hurry this morning... cargo is
obviously not calling PR_FUTEX_HASH which is new in 6.16 :/

> > Jun 16 02:51:34 beethoven kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
> > Jun 16 02:51:34 beethoven kernel: rcu:         16-....: (59997 ticks this GP) idle=eaf4/1/0x4000000000000000 softirq=14417247/14470115 fqs=21169
> > Jun 16 02:51:34 beethoven kernel: rcu:         (t=60000 jiffies g=21453525 q=663214 ncpus=24)
> > Jun 16 02:51:34 beethoven kernel: CPU: 16 UID: 1000 PID: 2028199 Comm: cargo Not tainted 6.16.0-rc1-lto-00236-g8c6bc74c7f89 #1 PREEMPT 
> > Jun 16 02:51:34 beethoven kernel: Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
> > Jun 16 02:51:34 beethoven kernel: RIP: 0010:queued_spin_lock_slowpath+0x162/0x1d0
> > Jun 16 02:51:34 beethoven kernel: Code: 0f 1f 84 00 00 00 00 00 f3 90 83 7a 08 00 74 f8 48 8b 32 48 85 f6 74 09 0f 0d 0e eb 0d 31 f6 eb 09 31 f6 eb 05 0f 1f 00 f3 90 <8b> 07 66 85 c0 75 f7 39 c8 75 13 41 b8 01 00 00 00 89 c8 f0 44 0f
> …
> > Jun 16 02:51:34 beethoven kernel: Call Trace:
> > Jun 16 02:51:34 beethoven kernel:  <TASK>
> > Jun 16 02:51:34 beethoven kernel:  __futex_pivot_hash+0x1f8/0x2e0
> > Jun 16 02:51:34 beethoven kernel:  futex_hash+0x95/0xe0
> > Jun 16 02:51:34 beethoven kernel:  futex_wait_setup+0x7e/0x230
> > Jun 16 02:51:34 beethoven kernel:  __futex_wait+0x66/0x130
> > Jun 16 02:51:34 beethoven kernel:  ? __futex_wake_mark+0xc0/0xc0
> > Jun 16 02:51:34 beethoven kernel:  futex_wait+0xee/0x180
> > Jun 16 02:51:34 beethoven kernel:  ? hrtimer_setup_sleeper_on_stack+0xe0/0xe0
> > Jun 16 02:51:34 beethoven kernel:  do_futex+0x86/0x120
> > Jun 16 02:51:34 beethoven kernel:  __se_sys_futex+0x16d/0x1e0
> > Jun 16 02:51:34 beethoven kernel:  do_syscall_64+0x47/0x170
> > Jun 16 02:51:34 beethoven kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> …
> > <repeats forever until I wake up and kill the machine>
> > 
> > It seems like this is well understood already, but let me know if
> > there's any debug info I can send that might be useful.
> 
> This is with LTO enabled.

Full lto with llvm-20.1.7.

> Based on the backtrace: there was a resize request (probably because a
> thread was created) and the resize was delayed because the hash was in
> use. The hash was released and now this thread moves all enqueued users
> from the old the hash to the new. RIP says it is a spin lock that it is
> stuck on. This is either the new or the old hash bucket lock.
> If this lifelocks then someone else must have it locked and not
> released.
> Is this the only thread stuck or is there more?
> I'm puzzled here. It looks as if there was an unlock missing.

Nothing showed up in the logs but the RCU stalls on CPU16, always in
queued_spin_lock_slowpath().

I'll run the build it was doing when it happened in a loop overnight and
see if I can trigger it again.

> > Thanks,
> > Calvin
> 
> Sebastian