linux-kernel - Re: [tip: locking/futex] futex: Use RCU-based per-CPU reference counting instead of rcuref

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aK0B8IYKIH1IHyDj@google.com>
Date: Mon, 25 Aug 2025 17:38:08 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: linux-kernel@...r.kernel.org
Cc: linux-tip-commits@...r.kernel.org, 
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, x86@...nel.org
Subject: Re: [tip: locking/futex] futex: Use RCU-based per-CPU reference
 counting instead of rcuref_t

On Fri, Aug 15, 2025, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the locking/futex branch of tip:
> > 
> > Commit-ID:     56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Gitweb:        https://git.kernel.org/tip/56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Author:        Peter Zijlstra <peterz@...radead.org>
> > AuthorDate:    Thu, 10 Jul 2025 13:00:07 +02:00
> > Committer:     Peter Zijlstra <peterz@...radead.org>
> > CommitterDate: Fri, 11 Jul 2025 16:02:00 +02:00
> > 
> > futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> > 
> > The use of rcuref_t for reference counting introduces a performance bottleneck
> > when accessed concurrently by multiple threads during futex operations.
> > 
> > Replace rcuref_t with special crafted per-CPU reference counters. The
> > lifetime logic remains the same.
> > 
> > The newly allocate private hash starts in FR_PERCPU state. In this state, each
> > futex operation that requires the private hash uses a per-CPU counter (an
> > unsigned int) for incrementing or decrementing the reference count.
> > 
> > When the private hash is about to be replaced, the per-CPU counters are
> > migrated to a atomic_t counter mm_struct::futex_atomic.
> > The migration process:
> > - Waiting for one RCU grace period to ensure all users observe the
> >   current private hash. This can be skipped if a grace period elapsed
> >   since the private hash was assigned.
> > 
> > - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
> >   use mm_struct::futex_atomic for reference counting.
> > 
> > - After a RCU grace period, all users are guaranteed to be using the
> >   atomic counter. The per-CPU counters can now be summed up and added to
> >   the atomic_t counter. If the resulting count is zero, the hash can be
> >   safely replaced. Otherwise, active users still hold a valid reference.
> > 
> > - Once the atomic reference count drops to zero, the next futex
> >   operation will switch to the new private hash.
> > 
> > call_rcu_hurry() is used to speed up transition which otherwise might be
> > delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> > side effects would be that on auto scaling the new hash is used later
> > and the SET_SLOTS prctl() will block longer.
> > 
> > [bigeasy: commit description + mm get/ put_async]
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
> > ---
> 
> This is causing explosions on my test systems, in code that doesn't obviously
> have anything to do with futex.

Closing the loop, this turned out to be a KVM bug[*].  Why the futex changes
exposed the bug and caused explosions, I have no idea, but nothing suggests that
this patch is buggy.

[*] https://lore.kernel.org/all/20250825160406.ZVcVPStz@linutronix.de


> The most common symptom is a #GP on this code in try_to_wake_up():
> 
> 		/* Link @node into the waitqueue. */
> 		WRITE_ONCE(prev->next, node);
> 
> although on systems with 5-level paging I _think_ it just manifests as hard
> hanges (I assume because prev->next is corrupted, but is still canonical with
> LA57?  But that's a wild guess).
> 
> The failure always occurs when userspace writes /sys/module/kvm/parameters/nx_huge_pages,
> but I don't think there's anything KVM specific about the issue.  Simply writing
> the param doesn't explode, the problem only arises when I'm running tests in
> parallel (but then failure is almost immediate), so presumably there's a task
> migration angle or something?
> 
> Manually disabling CONFIG_FUTEX_PRIVATE_HASH makes the problem go away, and
> running with CONFIG_FUTEX_PRIVATE_HASH=y prior to this rework is also fine.  So
> it appears that the problem is specifically in the new code.
> 
> I can provide more info as needed next week.
> 
> Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP

...

> Call Trace:
>  <TASK>
>  _raw_spin_lock_irqsave+0x50/0x60
>  try_to_wake_up+0x4f/0x5d0
>  set_nx_huge_pages+0xe4/0x1c0 [kvm]
>  param_attr_store+0x89/0xf0
>  module_attr_store+0x1e/0x30
>  kernfs_fop_write_iter+0xe4/0x160
>  vfs_write+0x2cb/0x420
>  ksys_write+0x7f/0xf0
>  do_syscall_64+0x6f/0x1f0
>  ? arch_exit_to_user_mode_prepare+0x9/0x50
>  entry_SYSCALL_64_after_hwframe+0x4b/0x53