lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250616142347.GA17781@willie-the-truck>
Date: Mon, 16 Jun 2025 15:23:48 +0100
From: Will Deacon <will@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Haas <t.haas@...bs.de>, Alan Stern <stern@...land.harvard.edu>,
	Andrea Parri <parri.andrea@...il.com>,
	Boqun Feng <boqun.feng@...il.com>,
	Nicholas Piggin <npiggin@...il.com>,
	David Howells <dhowells@...hat.com>,
	Jade Alglave <j.alglave@....ac.uk>,
	Luc Maranget <luc.maranget@...ia.fr>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Akira Yokosawa <akiyks@...il.com>,
	Daniel Lustig <dlustig@...dia.com>,
	Joel Fernandes <joelagnelf@...dia.com>,
	linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
	lkmm@...ts.linux.dev, hernan.poncedeleon@...weicloud.com,
	jonas.oberhauser@...weicloud.com,
	"r.maseli@...bs.de" <r.maseli@...bs.de>
Subject: Re: [RFC] Potential problem in qspinlock due to mixed-size accesses

On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> > We have been taking a look if mixed-size accesses (MSA) can affect the
> > correctness of qspinlock.
> > We are focusing on aarch64 which is the only memory model with MSA support
> > [1].
> > For this we extended the dartagnan [2] tool to support MSA and now it
> > reports liveness, synchronization, and mutex issues.
> > Notice that we did something similar in the past for LKMM, but we were
> > ignoring MSA [3].
> > 
> > The culprit of all these issues is that atomicity of single load
> > instructions is not guaranteed in the presence of smaller-sized stores
> > (observed on real hardware according to [1] and Fig. 21/22)
> > Consider the following pseudo code:
> > 
> >     int16 old = xchg16_rlx(&lock, 42);
> >     int32 l = load32_acq(&lock);
> > 
> > Then the hardware can treat the code as (likely due to store-forwarding)
> > 
> >     int16 old = xchg16_rlx(&lock, 42);
> >     int16 l1 = load16_acq(&lock);
> >     int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> > arithmetic
> > 
> > and reorder it to
> > 
> >     int16 l2 = load16_acq(&lock + 2);
> >     int16 old = xchg16_rlx(&lock, 42);
> >     int16 l1 = load16_acq(&lock);
> > 
> > Now another thread can overwrite "lock" in between the first two accesses so
> > that the original l (l1 and l2) ends up containing
> > parts of a lock value that is older than what the xchg observed.
> 
> Oops :-(
> 
> (snip the excellent details)
> 
> > ### Solutions
> > 
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> >     - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> >       Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> >       In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
> 
> Right, but that is the more expensive option for some.
> 
> >     - make the xchg_tail an acquire operation.
> >     - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
> 
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?

I'm half inclined to think that the Arm memory model should be tightened
here; I can raise that with Arm and see what they say.

Although the cited paper does give examples of store-forwarding from a
narrow store to a wider load, the case in qspinlock is further
constrained by having the store come from an atomic rmw and the load
having acquire semantics. Setting aside the MSA part, that specific case
_is_ ordered in the Arm memory model (and C++ release sequences rely on
it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
an atomic rmw to an acquire load.

Given that, I don't see how this is going to occur in practice.

Will

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ