linux-kernel - Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b11f09d-b938-4a8e-8c3a-c39b6fea2b21@huaweicloud.com>
Date: Tue, 17 Jun 2025 10:42:04 +0200
From: Hernan Ponce de Leon <hernan.poncedeleon@...weicloud.com>
To: Thomas Haas <t.haas@...bs.de>, Will Deacon <will@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>
Cc: Alan Stern <stern@...land.harvard.edu>,
 Andrea Parri <parri.andrea@...il.com>, Boqun Feng <boqun.feng@...il.com>,
 Nicholas Piggin <npiggin@...il.com>, David Howells <dhowells@...hat.com>,
 Jade Alglave <j.alglave@....ac.uk>, Luc Maranget <luc.maranget@...ia.fr>,
 "Paul E. McKenney" <paulmck@...nel.org>, Akira Yokosawa <akiyks@...il.com>,
 Daniel Lustig <dlustig@...dia.com>, Joel Fernandes <joelagnelf@...dia.com>,
 linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
 lkmm@...ts.linux.dev, jonas.oberhauser@...weicloud.com,
 "r.maseli@...bs.de" <r.maseli@...bs.de>
Subject: Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

On 6/17/2025 8:19 AM, Thomas Haas wrote:
> 
> 
> On 16.06.25 16:23, Will Deacon wrote:
>> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>>> correctness of qspinlock.
>>>> We are focusing on aarch64 which is the only memory model with MSA 
>>>> support
>>>> [1].
>>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>>> reports liveness, synchronization, and mutex issues.
>>>> Notice that we did something similar in the past for LKMM, but we were
>>>> ignoring MSA [3].
>>>>
>>>> The culprit of all these issues is that atomicity of single load
>>>> instructions is not guaranteed in the presence of smaller-sized stores
>>>> (observed on real hardware according to [1] and Fig. 21/22)
>>>> Consider the following pseudo code:
>>>>
>>>>      int16 old = xchg16_rlx(&lock, 42);
>>>>      int32 l = load32_acq(&lock);
>>>>
>>>> Then the hardware can treat the code as (likely due to store- 
>>>> forwarding)
>>>>
>>>>      int16 old = xchg16_rlx(&lock, 42);
>>>>      int16 l1 = load16_acq(&lock);
>>>>      int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>>> arithmetic
>>>>
>>>> and reorder it to
>>>>
>>>>      int16 l2 = load16_acq(&lock + 2);
>>>>      int16 old = xchg16_rlx(&lock, 42);
>>>>      int16 l1 = load16_acq(&lock);
>>>>
>>>> Now another thread can overwrite "lock" in between the first two 
>>>> accesses so
>>>> that the original l (l1 and l2) ends up containing
>>>> parts of a lock value that is older than what the xchg observed.
>>>
>>> Oops :-(
>>>
>>> (snip the excellent details)
>>>
>>>> ### Solutions
>>>>
>>>> The problematic executions rely on the fact that T2 can move half of 
>>>> its
>>>> load operation (1) to before the xchg_tail (3).
>>>> Preventing this reordering solves all issues. Possible solutions are:
>>>>      - make the xchg_tail full-sized (i.e, also touch lock/pending 
>>>> bits).
>>>>        Note that if the kernel is configured with >= 16k cpus, then 
>>>> the tail
>>>> becomes larger than 16 bits and needs to be encoded in parts of the 
>>>> pending
>>>> byte as well.
>>>>        In this case, the kernel makes a full-sized (32-bit) access 
>>>> for the
>>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>>
>>> Right, but that is the more expensive option for some.
>>>
>>>>      - make the xchg_tail an acquire operation.
>>>>      - make the xchg_tail a release operation (this is an odd 
>>>> solution by
>>>> itself but works for aarch64 because it preserves REL->ACQ 
>>>> ordering). In
>>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>>
>>> I think I prefer this one, it move a barrier, not really adding
>>> additional overhead. Will?
>>
>> I'm half inclined to think that the Arm memory model should be tightened
>> here; I can raise that with Arm and see what they say.
>>
>> Although the cited paper does give examples of store-forwarding from a
>> narrow store to a wider load, the case in qspinlock is further
>> constrained by having the store come from an atomic rmw and the load
>> having acquire semantics. Setting aside the MSA part, that specific case
>> _is_ ordered in the Arm memory model (and C++ release sequences rely on
>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
>> an atomic rmw to an acquire load.
>>
>> Given that, I don't see how this is going to occur in practice.
>>
>> Will
> 
> You are probably right. The ARM model's atomic-ordered-before relation
> 
>       let aob = rmw | [range(rmw)]; lrs; [A | Q]
> 
> clearly orders the rmw-store with subsequent acquire loads (lrs = local- 
> read-successor, A = acquire).
> If we treat this relation (at least the second part) as a "global 
> ordering" and extend it by "si" (same-instruction), then the problematic 
> reordering under MSA should be gone.
> I quickly ran Dartagnan on the MSA litmus tests with this change to the 
> ARM model and all the tests still pass.

Even with this change I still get violations (both safety and 
termination) for qspinlock with dartagnan.

I think the problem is actually with the Internal visibility axiom, 
because only making that one stronger seems to remove the violations.

Hernan

> 
> We should definitely ask ARM about this. I did sent an email to Jade 
> before writing about this issue here, but she was (and still is) busy 
> and told me to ask at memory-model@....com .
> I will ask them.
> 
>