[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f971b01a-2d76-4511-8880-0f4de2a042d9@redhat.com>
Date: Tue, 16 Sep 2025 12:58:54 -0400
From: Waiman Long <llong@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>, pengyu <pengyu@...inos.cn>
Cc: mingo@...hat.com, will@...nel.org, boqun.feng@...il.com,
linux-kernel@...r.kernel.org, Mark Rutland <mark.rutland@....com>,
t.haas@...bs.de, parri.andrea@...il.com, j.alglave@....ac.uk,
luc.maranget@...ia.fr, paulmck@...nel.org, jonas.oberhauser@...weicloud.com,
r.maseli@...bs.de, lkmm@...ts.linux.dev, stern@...land.harvard.edu
Subject: Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for
arm64
On 9/16/25 10:10 AM, Peter Zijlstra wrote:
> On Tue, Sep 16, 2025 at 11:39:03AM +0800, pengyu wrote:
>> From: Yu Peng <pengyu@...inos.cn>
>>
>> A hardlock detected on arm64: rq->lock was released, but a CPU
>> blocked at mcs_node->locked and timed out.
>>
>> We found xchg_tail and atomic_try_cmpxchg_relaxed used _relaxed
>> versions without memory barriers. Suspected insufficient coherence
>> guarantees on some arm64 microarchitectures, potentially leading to
>> the following issues occurred:
>>
>> CPU0: CPU1:
>> // Set tail to CPU0
>> old = xchg_tail(lock, tail);
>>
>> //CPU0 read tail is itself
>> if ((val & _Q_TAIL_MASK) == tail)
>> // CPU1 exchanges the tail
>> old = xchg_tail(lock, tail)
>> //assuming CPU0 not see tail change
>> atomic_try_cmpxchg_relaxed(
>> &lock->val, &val, _Q_LOCKED_VAL)
>> //released without notifying CPU1
>> goto release;
>> //hardlock detected
>> arch_mcs_spin_lock_contended(
>> &node->locked)
>>
>> Therefore, xchg_tail and atomic_try_cmpxchg using _mb to replace _relaxed.
> Yeah, no. We do not apply patches based on suspicion. And we most
> certainly do not sprinkle #ifdef ARM64 in generic code.
>
> There is this thread:
>
> https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
Ah, I was not cc'ed on this email thread. That is why I was not aware of
this discussion about xchg_tail(). It is an interesting read.
Anyway, this particular problem may be about the clarity of the arm64
memory model and whether any microarch's strictly follow it or not.
Cheers,
Longman
Powered by blists - more mailing lists