linux-kernel - Re: [RFC PATCH v2 0/1] arch: arm64: Implement unaligned atomic emulation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aV1YnOetDHhKe4hz@willie-the-truck>
Date: Tue, 6 Jan 2026 18:46:52 +0000
From: Will Deacon <will@...nel.org>
To: André Almeida <andrealmeid@...lia.com>
Cc: Catalin Marinas <catalin.marinas@....com>,
	Mark Rutland <mark.rutland@....com>,
	Mark Brown <broonie@...nel.org>,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	kernel-dev@...lia.com, Ryan Houdek <Sonicadvance1@...il.com>,
	Billy Laws <blaws05@...il.com>
Subject: Re: [RFC PATCH v2 0/1] arch: arm64: Implement unaligned atomic
 emulation

Hi Andre,

On Mon, Nov 17, 2025 at 01:08:40PM -0300, André Almeida wrote:
> This patch proposes adding kernel-side emulation for unaligned atomic
> instructions on ARM64. This is intended for x86 emulators (like FEX)
> that struggle to effectively handle such operations in userspace alone.
> Such handling is required as x86 permits such unaligned accesses (albeit
> sometimes with a performance penalty as in the case of split-locks[1])
> but ARM64 does not and will raise a bus error.

As discussed at LPC, I'm not thrilled with the idea of emulating these
historical warts of x86 on arm64 and having to carry that code forward
forever. The architecture provides a bunch of helpful instructions for
emulating x86 code but for esoteric things like split-lock atomics, I
think we need to draw a line. After all, you can still buy an x86
machine if you want one and they tend to be cheaper and more reliable ;)

> User applications that wish to enable support for this can use the new
> pctrl() flag `PR_ARM64_UNALIGN_ATOMIC_EMULATE`. Some optimizations and
> instructions were left for future revisions of this patchset.
> 
> Emulators like FEX attempt to emulate this in userspace, but with
> caveats in two areas:
> 
>  * Performance
> 
> It should first be noted that due to x86's TSO (total store order)
> memory model, ARM64 atomic instructions must be used for all memory
> accesses.

Just a nit on terminology because it's getting in the way a bit here and
I'm genuinely unsure as to what you're saying. The Arm architecture uses
"atomic instructions" to refer to read-modify-write instructions such as
CAS and SWP. You presumably don't need to use those for everything;
rather I assume you're using LDAPR for plain loads and STLR for plain
stores?

> This results in unaligned loads/stores being much more common
> than one would expect and the overhead of emulating them significantly
> impacting performance.

FEAT_LSE2 should solve this for LDAPR and STLR in the vast majority of
cases, no?

> For this common case of unaligned loads/stores, code backpatching is used
> in FEX to avoid repeated overhead from handling the same faulting access.
> This replaces faulting unaligned atomic ARM64 instructions with regular
> load/stores and memory barriers.

I'm assuming this is only the case on systems without FEAT_LSE2?

> This comes at a cost of introducing significant performance problems if a
> function like memcpy ends up being patched because it very infrequently
> happens to be used with unaligned memory. This is severe enough to make
> games like Mirror's Edge and Assassin's Creed: Origin unplayable without
> application-specific configuration.
> 
> Microbenchmarks[2] measure a 4x decrease in overhead with kernel-side
> handling compared to userspace, and this figure is currently even larger
> when FEX is ran under Wine. Such a dramatic decrease would make it
> reasonable for FEX to default to the no-backpatching path and provide
> consistent performance. 

I wonder if you can defer backpatching until you've got some idea about
the likelihood of the access being misaligned? Or is that what you mean
by "application-specific configuration"?

>  * Correctness:
> 
> x86 atomic accesses can cross 16-byte (LSE2) granules, but there is no
> ARM64 instruction that would allow for direct atomic emulation of this.

I'm assuming these are relatively rare and apply only to locked
instructions so don't, for example, occur in memcpy() as you mentioned
above?

> As such, a lock must be taken for correctness. While this is easy to
> emulate in userspace within a process, correct atomic cross-granule
> operations on shared memory mapped into multiple processes would require
> a lock shared between all FEX instances which cannot be implemented
> safely in userspace as is (robust futexes do not work here as they are
> under the control of the emulated x86 program). Note, this is a less
> coarse restriction than split locks on x86, which are only concerned
> with accesses crossing a 64 byte cacheline size.

Even in the kernel, I'm having a hard time to convince myself that a
global lock gives you correctness here because you're _not_ serialising
with concurrent aligned accesses that could still be conflicting with
the faulting accesses if mixed-size concurrency is in use.

I've struggled to come up with an example that goes wrong, but if we
imagine a 64-bit SWP that straddles a 16-byte boundary and is trying to
set each 32-bit half of a zero-initialised location to 2:

	SWP	(2, 2)

This will trap, but another CPU could concurrently execute a pair of
ordered, aligned 32-bit SWPs on each half, which will not trap:

	SWP	(1, _)	// Set the upper 32-bit word to 1
	DMB	ISH
	SWP	(_, 1)	// Set the lower 32-bit word to 1

My reading of the code is that:

  1. The 64-bit SWP reads upper = 0, lower = 0
  2. The 64-bit SWP writes upper to 2
  3. The first 32-bit SWP executes, setting upper to 1 and returning 2
  4. The second 32-bit SWP executes, setting lower to 1 and returning 0
  5. The 64-bit SWP fails to write lower as it is no longer zero and so
     returns (0, 1)

In which case, the value in memory is (1, 1) but the old value of (0, 1)
returned by the 64-bit SWP doesn't make sense in that case.

What did I miss?

I also notice that your patch doesn't implement support for CAS and I
worry that letting out partial updates from supposedly atomic stores
could cause additional issues there. A CAS that reports failure could
actually have mutated memory, which then feeds into the next iteration
of the CAS loop in userspace.

Anyway, my lack of understanding aside, this doesn't look like an ABI we
should be committing to on arm64.

Cheers,

Will