linux-kernel - Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250613075501.GI2273038@noisy.programming.kicks-ass.net>
Date: Fri, 13 Jun 2025 09:55:01 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Thomas Haas <t.haas@...bs.de>
Cc: Alan Stern <stern@...land.harvard.edu>,
	Andrea Parri <parri.andrea@...il.com>,
	Will Deacon <will@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
	Nicholas Piggin <npiggin@...il.com>,
	David Howells <dhowells@...hat.com>,
	Jade Alglave <j.alglave@....ac.uk>,
	Luc Maranget <luc.maranget@...ia.fr>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Akira Yokosawa <akiyks@...il.com>,
	Daniel Lustig <dlustig@...dia.com>,
	Joel Fernandes <joelagnelf@...dia.com>,
	linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
	lkmm@...ts.linux.dev, hernan.poncedeleon@...weicloud.com,
	jonas.oberhauser@...weicloud.com,
	"r.maseli@...bs.de" <r.maseli@...bs.de>
Subject: Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> We have been taking a look if mixed-size accesses (MSA) can affect the
> correctness of qspinlock.
> We are focusing on aarch64 which is the only memory model with MSA support
> [1].
> For this we extended the dartagnan [2] tool to support MSA and now it
> reports liveness, synchronization, and mutex issues.
> Notice that we did something similar in the past for LKMM, but we were
> ignoring MSA [3].
> 
> The culprit of all these issues is that atomicity of single load
> instructions is not guaranteed in the presence of smaller-sized stores
> (observed on real hardware according to [1] and Fig. 21/22)
> Consider the following pseudo code:
> 
>     int16 old = xchg16_rlx(&lock, 42);
>     int32 l = load32_acq(&lock);
> 
> Then the hardware can treat the code as (likely due to store-forwarding)
> 
>     int16 old = xchg16_rlx(&lock, 42);
>     int16 l1 = load16_acq(&lock);
>     int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> arithmetic
> 
> and reorder it to
> 
>     int16 l2 = load16_acq(&lock + 2);
>     int16 old = xchg16_rlx(&lock, 42);
>     int16 l1 = load16_acq(&lock);
> 
> Now another thread can overwrite "lock" in between the first two accesses so
> that the original l (l1 and l2) ends up containing
> parts of a lock value that is older than what the xchg observed.

Oops :-(

(snip the excellent details)

> ### Solutions
> 
> The problematic executions rely on the fact that T2 can move half of its
> load operation (1) to before the xchg_tail (3).
> Preventing this reordering solves all issues. Possible solutions are:
>     - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
>       Note that if the kernel is configured with >= 16k cpus, then the tail
> becomes larger than 16 bits and needs to be encoded in parts of the pending
> byte as well.
>       In this case, the kernel makes a full-sized (32-bit) access for the
> xchg. So the above bugs are only present in the < 16k cpus setting.

Right, but that is the more expensive option for some.

>     - make the xchg_tail an acquire operation.
>     - make the xchg_tail a release operation (this is an odd solution by
> itself but works for aarch64 because it preserves REL->ACQ ordering). In
> this case, maybe the preceding "smp_wmb()" can be removed.

I think I prefer this one, it move a barrier, not really adding
additional overhead. Will?

>     - put some other read-read barrier between the xchg_tail and the load.
> 
> 
> ### Implications for qspinlock executed on non-ARM architectures.
> 
> Unfortunately, there are no MSA extensions for other hardware memory models,
> so we have to speculate based on whether the problematic reordering is
> permitted if the problematic load was treated as two individual
> instructions.
> It seems Power and RISCV would have no problem reordering the instructions,
> so qspinlock might also break on those architectures.

Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
word LL/SC and should be good.

But yes, ZABHA might be equally broken.

> TSO, on the other hand, does not permit such reordering. Also, the xchg_tail
> is a rmw operation which acts like a full memory barrier under TSO, so even
> if load-load reordering was permitted, the rmw would prevent this.

Right.