linux-kernel - Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d66d0351-b523-40da-ae47-8b06f37bf3b6@tu-bs.de>
Date: Tue, 17 Jun 2025 08:19:19 +0200
From: Thomas Haas <t.haas@...bs.de>
To: Will Deacon <will@...nel.org>, Peter Zijlstra <peterz@...radead.org>
CC: Alan Stern <stern@...land.harvard.edu>, Andrea Parri
	<parri.andrea@...il.com>, Boqun Feng <boqun.feng@...il.com>, Nicholas Piggin
	<npiggin@...il.com>, David Howells <dhowells@...hat.com>, Jade Alglave
	<j.alglave@....ac.uk>, Luc Maranget <luc.maranget@...ia.fr>, "Paul E.
 McKenney" <paulmck@...nel.org>, Akira Yokosawa <akiyks@...il.com>, Daniel
 Lustig <dlustig@...dia.com>, Joel Fernandes <joelagnelf@...dia.com>,
	<linux-kernel@...r.kernel.org>, <linux-arch@...r.kernel.org>,
	<lkmm@...ts.linux.dev>, <hernan.poncedeleon@...weicloud.com>,
	<jonas.oberhauser@...weicloud.com>, "r.maseli@...bs.de" <r.maseli@...bs.de>
Subject: Re: [RFC] Potential problem in qspinlock due to mixed-size accesses



On 16.06.25 16:23, Will Deacon wrote:
> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>> correctness of qspinlock.
>>> We are focusing on aarch64 which is the only memory model with MSA support
>>> [1].
>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>> reports liveness, synchronization, and mutex issues.
>>> Notice that we did something similar in the past for LKMM, but we were
>>> ignoring MSA [3].
>>>
>>> The culprit of all these issues is that atomicity of single load
>>> instructions is not guaranteed in the presence of smaller-sized stores
>>> (observed on real hardware according to [1] and Fig. 21/22)
>>> Consider the following pseudo code:
>>>
>>>      int16 old = xchg16_rlx(&lock, 42);
>>>      int32 l = load32_acq(&lock);
>>>
>>> Then the hardware can treat the code as (likely due to store-forwarding)
>>>
>>>      int16 old = xchg16_rlx(&lock, 42);
>>>      int16 l1 = load16_acq(&lock);
>>>      int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>> arithmetic
>>>
>>> and reorder it to
>>>
>>>      int16 l2 = load16_acq(&lock + 2);
>>>      int16 old = xchg16_rlx(&lock, 42);
>>>      int16 l1 = load16_acq(&lock);
>>>
>>> Now another thread can overwrite "lock" in between the first two accesses so
>>> that the original l (l1 and l2) ends up containing
>>> parts of a lock value that is older than what the xchg observed.
>>
>> Oops :-(
>>
>> (snip the excellent details)
>>
>>> ### Solutions
>>>
>>> The problematic executions rely on the fact that T2 can move half of its
>>> load operation (1) to before the xchg_tail (3).
>>> Preventing this reordering solves all issues. Possible solutions are:
>>>      - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
>>>        Note that if the kernel is configured with >= 16k cpus, then the tail
>>> becomes larger than 16 bits and needs to be encoded in parts of the pending
>>> byte as well.
>>>        In this case, the kernel makes a full-sized (32-bit) access for the
>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>
>> Right, but that is the more expensive option for some.
>>
>>>      - make the xchg_tail an acquire operation.
>>>      - make the xchg_tail a release operation (this is an odd solution by
>>> itself but works for aarch64 because it preserves REL->ACQ ordering). In
>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>
>> I think I prefer this one, it move a barrier, not really adding
>> additional overhead. Will?
> 
> I'm half inclined to think that the Arm memory model should be tightened
> here; I can raise that with Arm and see what they say.
> 
> Although the cited paper does give examples of store-forwarding from a
> narrow store to a wider load, the case in qspinlock is further
> constrained by having the store come from an atomic rmw and the load
> having acquire semantics. Setting aside the MSA part, that specific case
> _is_ ordered in the Arm memory model (and C++ release sequences rely on
> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
> an atomic rmw to an acquire load.
> 
> Given that, I don't see how this is going to occur in practice.
> 
> Will

You are probably right. The ARM model's atomic-ordered-before relation

      let aob = rmw | [range(rmw)]; lrs; [A | Q]

clearly orders the rmw-store with subsequent acquire loads (lrs = 
local-read-successor, A = acquire).
If we treat this relation (at least the second part) as a "global 
ordering" and extend it by "si" (same-instruction), then the problematic 
reordering under MSA should be gone.
I quickly ran Dartagnan on the MSA litmus tests with this change to the 
ARM model and all the tests still pass.

We should definitely ask ARM about this. I did sent an email to Jade 
before writing about this issue here, but she was (and still is) busy 
and told me to ask at memory-model@....com .
I will ask them.


-- 
=====================================

Thomas Haas

Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany

t.haas@...braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas