[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b89a9a8-7114-452e-bf7c-86f0cedbe01d@redhat.com>
Date: Tue, 16 Sep 2025 09:27:20 -0400
From: Waiman Long <llong@...hat.com>
To: pengyu <pengyu@...inos.cn>, peterz@...radead.org, mingo@...hat.com,
will@...nel.org, boqun.feng@...il.com
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for
arm64
On 9/15/25 11:39 PM, pengyu wrote:
> From: Yu Peng <pengyu@...inos.cn>
>
> A hardlock detected on arm64: rq->lock was released, but a CPU
> blocked at mcs_node->locked and timed out.
>
> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
> versions without memory barriers. Suspected insufficient coherence
> guarantees on some arm64 microarchitectures, potentially leading to
> the following issues occurred:
>
> CPU0: CPU1:
> // Set tail to CPU0
> old = xchg_tail(lock, tail);
>
> //CPU0 read tail is itself
> if ((val & _Q_TAIL_MASK) == tail)
> // CPU1 exchanges the tail
> old = xchg_tail(lock, tail)
> //assuming CPU0 not see tail change
> atomic_try_cmpxchg_relaxed(
> &lock->val, &val, _Q_LOCKED_VAL)
> //released without notifying CPU1
> goto release;
> //hardlock detected
> arch_mcs_spin_lock_contended(
> &node->locked)
>
> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
>
> Signed-off-by: pengyu <pengyu@...inos.cn>
The qspinlock code had been enabled for arm64 for quite a long time.
This is the first time that we got report like this. How reproducible is
this hangup problem?
What arm64 architecture has this problem? It can be a hardware bug.
Anyway, changing a relaxed version of atomic op to a fully barrier
version can be expensive on arm64 in general. We need more information
to ensure that we are doing the right thing.
Cheers,
Longman
Powered by blists - more mailing lists