linux-kernel - Re: [PATCH] locking/qspinlock: use xchg with

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8a713303-65f7-47e3-b2e6-6262f044d728@kylinos.cn>
Date: Fri, 19 Sep 2025 18:22:23 +0800
From: pengyu <pengyu@...inos.cn>
To: Will Deacon <will@...nel.org>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 boqun.feng@...il.com, longman@...hat.com, linux-kernel@...r.kernel.org,
 Mark Rutland <mark.rutland@....com>, t.haas@...bs.de,
 parri.andrea@...il.com, j.alglave@....ac.uk, luc.maranget@...ia.fr,
 paulmck@...nel.org, jonas.oberhauser@...weicloud.com, r.maseli@...bs.de,
 lkmm@...ts.linux.dev, stern@...land.harvard.edu
Subject: Re: [PATCH] locking/qspinlock: use xchg with _mb in slowpath for
 arm64

On 2025/9/17 19:37, Will Deacon wrote:
> On Wed, Sep 17, 2025 at 06:51:18PM +0800, pengyu wrote:
>> Yes, this issue occurred on a kunpeng920 96-core machine and only
>> affected a small number of systems that had been running for over a
>> year.
>>
>> Vmcore Analysis:
>> • Panic triggered by CPU 83 detecting a hard lockup at
>>      queued_spin_lock_slowpath+0x1d8/0x320.
>>
>> • Corresponding code:
>>      arch_mcs_spin_lock_contended(&node->locked);
>>
>> • The qspinlock involved was the rq lock, which showed a cleared state:
>>      crash> rq.lock,cpu ffffad96ff2907c0
>>        lock = {
>>          raw_lock = {
>>            {
>>              val = {
>>                counter = 0
>>              },
>>              {
>>                locked = 0 '\000',
>>                pending = 0 '\000'
>>              },
>>              {
>>                locked_pending = 0,
>>                tail = 0
>>              }
>>            }
>>          }
>>        },
>>        cpu = 50,
>>
>> • CPU 83’s MCS node remained in a locked=0 state, with no previous
>> node found in the qnodes list.
>>      crash> p qnodes:83
>>      per_cpu(qnodes, 83) = $292 =
>>       {{
>>          mcs = {
>>            next = 0x0,
>>            locked = 0,
>>            count = 1
>>          }
>>        },
>>      crash> p qnodes | grep 83
>>        [83]: ffffadd6bf7914c0
>>      crash> p qnodes:all | grep ffffadd6bf7914c0
>>      crash>
>>
>> • Since rq->lock was cleared, no CPU could notify CPU 83.
>>
>> This issue has occurred multiple times, but the root cause remains
>> unclear. We suspect that CPU 83 may have failed to enqueue itself,
>> potentially due to a failure in the xchg_tail atomic operation.
> 
> Hmm. For the lock word to be clear with a CPU spinning on its MCS node
> then something has gone quite badly wrong. I think that would mean that:
> 
>    1. The spinning CPU has updated tail to point to its node (xchg_tail())
>    2. The lock-owning CPU then erroneously cleared the tail field
>       (atomic_try_cmpxchg_relaxed())
> 
> But for the cmpxchg() to succeed in (2) then the xchg() in (1) must be
> ordered after it and the lock word wouldn't end up as zero. This is
> because RmW atomics must be totally ordered for a given memory location
> and that applies regardless of their memory ordering properties.
> 
> Of course, there could be _a_ bug here but, given the information you've
> been able to provide, it's not obviously as "simple" as a missing memory
> barrier. Have you confirmed that adding memory barriers makes the problem
> go away?
> 

Could this mean that even with xchg's relaxed version, cmpxchg would be
impossible to succeed after xchg?

We're unsure of the exact cause, only speculating that memory barriers
might control CPU Store Buffer flushing so that other CPUs can see the
modifications immediately，but our understanding of this is quite limited.

Moreover, each test cycle is lengthy, making it difficult to confirm
whether adding memory barriers resolves the issue.

> If you're able to check the thread_info (via sp_el0) of CPU83 in your
> example, it might be interesting to see whether or not the 'cpu' field
> has been corrupted. For example, if it ends up being read as -1 then we
> may compute a tail of 0 when enqueuing our MCS node into the lock word.
> 

I checked the code for xchg_tail:
   lsr     w1, w2, #16
   add     x7, x19, #0x2
   swph    w1, w0, [x7]

x19 is the address of the rq lock, and tail is stored in w2. The dumped
registers are:
   x19: ffffad96ff2907c0
   x2 : 0000000001500000

So CPU 83 should calculate tail as 0x1500000, not 0.

>> It has been noted that the _relaxed version is used in xchx_tail, and we
>> are uncertain whether this could lead to visibility issues—for example,
>> if CPU 83 modifies lock->tail, but other CPUs fail to observe the
>> change.
>>
>> We are also checking if this is related：
>>      https://lkml.kernel.org/r/cb83e3e4-9e22-4457-bf61-5614cc4396ad@tu-bs.de
> 
> Hopefully I (or somebody from Arm) can provide an update soon on this
> topic but I wouldn't necessarily expect it to help you with this new
> case.
> 
> Will

We also have another similar issue:
• Watchdog detected hard LOCKUP on cpu 18
• pc : queued_spin_lock_slowpath+0x2ac

It was stuck in a loop at ：
     next = smp_cond_load_relaxed(&node->next, (VAL));

The read mcs node->next is 0:
crash> p qnodes:18
per_cpu(qnodes, 18) = $1 =
  {{
     mcs = {
       next = 0x0,
       locked = 0,
       count = 1
     }
   },

This seems consistent with the phenomenon described in (6b), but we
found that lock->val read by the following code was incorrectly cleared:

  val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));

According to the assembly code, lock->val is loaded into w5:
   ldar    w5, [x19]
   mov     w0, w5
   and     w1, w5, #0xffff
The value of register x5 is 0:
   x5 : 0000000000000000

If it were just reordering in xchg_tail, I don't think tail would be
erroneously cleared. Might there be other possibilities for this issue?

-- 
Thanks,
Yu Peng