[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aK0B8IYKIH1IHyDj@google.com>
Date: Mon, 25 Aug 2025 17:38:08 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: linux-kernel@...r.kernel.org
Cc: linux-tip-commits@...r.kernel.org,
"Peter Zijlstra (Intel)" <peterz@...radead.org>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, x86@...nel.org
Subject: Re: [tip: locking/futex] futex: Use RCU-based per-CPU reference
counting instead of rcuref_t
On Fri, Aug 15, 2025, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the locking/futex branch of tip:
> >
> > Commit-ID: 56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Gitweb: https://git.kernel.org/tip/56180dd20c19e5b0fa34822997a9ac66b517e7b3
> > Author: Peter Zijlstra <peterz@...radead.org>
> > AuthorDate: Thu, 10 Jul 2025 13:00:07 +02:00
> > Committer: Peter Zijlstra <peterz@...radead.org>
> > CommitterDate: Fri, 11 Jul 2025 16:02:00 +02:00
> >
> > futex: Use RCU-based per-CPU reference counting instead of rcuref_t
> >
> > The use of rcuref_t for reference counting introduces a performance bottleneck
> > when accessed concurrently by multiple threads during futex operations.
> >
> > Replace rcuref_t with special crafted per-CPU reference counters. The
> > lifetime logic remains the same.
> >
> > The newly allocate private hash starts in FR_PERCPU state. In this state, each
> > futex operation that requires the private hash uses a per-CPU counter (an
> > unsigned int) for incrementing or decrementing the reference count.
> >
> > When the private hash is about to be replaced, the per-CPU counters are
> > migrated to a atomic_t counter mm_struct::futex_atomic.
> > The migration process:
> > - Waiting for one RCU grace period to ensure all users observe the
> > current private hash. This can be skipped if a grace period elapsed
> > since the private hash was assigned.
> >
> > - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
> > use mm_struct::futex_atomic for reference counting.
> >
> > - After a RCU grace period, all users are guaranteed to be using the
> > atomic counter. The per-CPU counters can now be summed up and added to
> > the atomic_t counter. If the resulting count is zero, the hash can be
> > safely replaced. Otherwise, active users still hold a valid reference.
> >
> > - Once the atomic reference count drops to zero, the next futex
> > operation will switch to the new private hash.
> >
> > call_rcu_hurry() is used to speed up transition which otherwise might be
> > delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> > side effects would be that on auto scaling the new hash is used later
> > and the SET_SLOTS prctl() will block longer.
> >
> > [bigeasy: commit description + mm get/ put_async]
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
> > ---
>
> This is causing explosions on my test systems, in code that doesn't obviously
> have anything to do with futex.
Closing the loop, this turned out to be a KVM bug[*]. Why the futex changes
exposed the bug and caused explosions, I have no idea, but nothing suggests that
this patch is buggy.
[*] https://lore.kernel.org/all/20250825160406.ZVcVPStz@linutronix.de
> The most common symptom is a #GP on this code in try_to_wake_up():
>
> /* Link @node into the waitqueue. */
> WRITE_ONCE(prev->next, node);
>
> although on systems with 5-level paging I _think_ it just manifests as hard
> hanges (I assume because prev->next is corrupted, but is still canonical with
> LA57? But that's a wild guess).
>
> The failure always occurs when userspace writes /sys/module/kvm/parameters/nx_huge_pages,
> but I don't think there's anything KVM specific about the issue. Simply writing
> the param doesn't explode, the problem only arises when I'm running tests in
> parallel (but then failure is almost immediate), so presumably there's a task
> migration angle or something?
>
> Manually disabling CONFIG_FUTEX_PRIVATE_HASH makes the problem go away, and
> running with CONFIG_FUTEX_PRIVATE_HASH=y prior to this rework is also fine. So
> it appears that the problem is specifically in the new code.
>
> I can provide more info as needed next week.
>
> Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP
...
> Call Trace:
> <TASK>
> _raw_spin_lock_irqsave+0x50/0x60
> try_to_wake_up+0x4f/0x5d0
> set_nx_huge_pages+0xe4/0x1c0 [kvm]
> param_attr_store+0x89/0xf0
> module_attr_store+0x1e/0x30
> kernfs_fop_write_iter+0xe4/0x160
> vfs_write+0x2cb/0x420
> ksys_write+0x7f/0xf0
> do_syscall_64+0x6f/0x1f0
> ? arch_exit_to_user_mode_prepare+0x9/0x50
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
Powered by blists - more mailing lists