[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABnRqDf5EQUoXu=pJ6mj4-JfwAzEfcAE2cYrNzJANFycx7cMUA@mail.gmail.com>
Date: Fri, 16 Jan 2026 13:58:18 -0800
From: Ryan Houdek <sonicadvance1@...il.com>
To: Will Deacon <will@...nel.org>
Cc: André Almeida <andrealmeid@...lia.com>,
Catalin Marinas <catalin.marinas@....com>, Mark Rutland <mark.rutland@....com>,
Mark Brown <broonie@...nel.org>, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, kernel-dev@...lia.com,
Billy Laws <blaws05@...il.com>
Subject: Re: [RFC PATCH v2 0/1] arch: arm64: Implement unaligned atomic emulation
Hey there,
I'll try and explain some of the requirements here from the FEX side.
On Tue, Jan 6, 2026 at 10:46 AM Will Deacon <will@...nel.org> wrote:
>
> Hi Andre,
>
> On Mon, Nov 17, 2025 at 01:08:40PM -0300, André Almeida wrote:
> > This patch proposes adding kernel-side emulation for unaligned atomic
> > instructions on ARM64. This is intended for x86 emulators (like FEX)
> > that struggle to effectively handle such operations in userspace alone.
> > Such handling is required as x86 permits such unaligned accesses (albeit
> > sometimes with a performance penalty as in the case of split-locks[1])
> > but ARM64 does not and will raise a bus error.
>
> As discussed at LPC, I'm not thrilled with the idea of emulating these
> historical warts of x86 on arm64 and having to carry that code forward
> forever. The architecture provides a bunch of helpful instructions for
> emulating x86 code but for esoteric things like split-lock atomics, I
> think we need to draw a line. After all, you can still buy an x86
> machine if you want one and they tend to be cheaper and more reliable ;)
Indeed.
>
> > User applications that wish to enable support for this can use the new
> > pctrl() flag `PR_ARM64_UNALIGN_ATOMIC_EMULATE`. Some optimizations and
> > instructions were left for future revisions of this patchset.
> >
> > Emulators like FEX attempt to emulate this in userspace, but with
> > caveats in two areas:
> >
> > * Performance
> >
> > It should first be noted that due to x86's TSO (total store order)
> > memory model, ARM64 atomic instructions must be used for all memory
> > accesses.
>
> Just a nit on terminology because it's getting in the way a bit here and
> I'm genuinely unsure as to what you're saying. The Arm architecture uses
> "atomic instructions" to refer to read-modify-write instructions such as
> CAS and SWP. You presumably don't need to use those for everything;
> rather I assume you're using LDAPR for plain loads and STLR for plain
> stores?
Semantics aside, yes, FEX uses all of the LRCPC1/2/3/4 instructions,
plus atomic memory instructions to emulate the x86 memory model.
Falling back all the way to ARMv8.0 instructions as required.
>
> > This results in unaligned loads/stores being much more common
> > than one would expect and the overhead of emulating them significantly
> > impacting performance.
>
> FEAT_LSE2 should solve this for LDAPR and STLR in the vast majority of
> cases, no?
This is a very sweeping statement. FEAT_LSE2 reduces around 12% of
unaligned memory accesses,
but because it only works at a 16B granularity, it doesn't cover the
vast majority of memory accesses.
It also means that we have "split-lock" style emulation at both
cache-line granularity AND 16B granularity;
unlike x86.
This means the overhead on systems even with LSE2 is unacceptable.
>
> > For this common case of unaligned loads/stores, code backpatching is used
> > in FEX to avoid repeated overhead from handling the same faulting access.
> > This replaces faulting unaligned atomic ARM64 instructions with regular
> > load/stores and memory barriers.
>
> I'm assuming this is only the case on systems without FEAT_LSE2?
We are required to backpatch even on LSE2 systems, as previously
stated, LSE2 isn't good enough.
>
> > This comes at a cost of introducing significant performance problems if a
> > function like memcpy ends up being patched because it very infrequently
> > happens to be used with unaligned memory. This is severe enough to make
> > games like Mirror's Edge and Assassin's Creed: Origin unplayable without
> > application-specific configuration.
> >
> > Microbenchmarks[2] measure a 4x decrease in overhead with kernel-side
> > handling compared to userspace, and this figure is currently even larger
> > when FEX is ran under Wine. Such a dramatic decrease would make it
> > reasonable for FEX to default to the no-backpatching path and provide
> > consistent performance.
>
> I wonder if you can defer backpatching until you've got some idea about
> the likelihood of the access being misaligned? Or is that what you mean
> by "application-specific configuration"?
Not really. If an access ends up being misaligned once, it's basically
guaranteed to always
hit unaligned accesses. The heuristic would do nothing.
Application-specific configuration here means manually inspecting what a game
is doing, seeing if the code requires TSO semantics or not in the general case,
and then having a configuration inside of the emulator to disable TSO
for that code.
I have yet to work through the 145,370 games on Steam to quantify which code
requires TSO emulation versus not to make an encompassing set of
application profiles.
These application profiles are of course a hack because we are then no
longer emulating
the memory model fully. So we're throwing away correctness on ARM hardware to
allow users to have a viable gameplay experience.
>
> > * Correctness:
> >
> > x86 atomic accesses can cross 16-byte (LSE2) granules, but there is no
> > ARM64 instruction that would allow for direct atomic emulation of this.
>
> I'm assuming these are relatively rare and apply only to locked
> instructions so don't, for example, occur in memcpy() as you mentioned
> above?
These are common, and do easily occur in hand-rolled memcpy/memset
operations easily.
FEAT_LRCPC instructions have the same unaligned fault semantics as
atomic instructions
with FEAT_LSE2. These are used for even trivial memory accesses like
`mov rax, [rdx]`.
>
> > As such, a lock must be taken for correctness. While this is easy to
> > emulate in userspace within a process, correct atomic cross-granule
> > operations on shared memory mapped into multiple processes would require
> > a lock shared between all FEX instances which cannot be implemented
> > safely in userspace as is (robust futexes do not work here as they are
> > under the control of the emulated x86 program). Note, this is a less
> > coarse restriction than split locks on x86, which are only concerned
> > with accesses crossing a 64 byte cacheline size.
>
> Even in the kernel, I'm having a hard time to convince myself that a
> global lock gives you correctness here because you're _not_ serialising
> with concurrent aligned accesses that could still be conflicting with
> the faulting accesses if mixed-size concurrency is in use.
>
> I've struggled to come up with an example that goes wrong, but if we
> imagine a 64-bit SWP that straddles a 16-byte boundary and is trying to
> set each 32-bit half of a zero-initialised location to 2:
>
> SWP (2, 2)
>
> This will trap, but another CPU could concurrently execute a pair of
> ordered, aligned 32-bit SWPs on each half, which will not trap:
>
> SWP (1, _) // Set the upper 32-bit word to 1
> DMB ISH
> SWP (_, 1) // Set the lower 32-bit word to 1
>
> My reading of the code is that:
>
> 1. The 64-bit SWP reads upper = 0, lower = 0
> 2. The 64-bit SWP writes upper to 2
> 3. The first 32-bit SWP executes, setting upper to 1 and returning 2
> 4. The second 32-bit SWP executes, setting lower to 1 and returning 0
> 5. The 64-bit SWP fails to write lower as it is no longer zero and so
> returns (0, 1)
>
> In which case, the value in memory is (1, 1) but the old value of (0, 1)
> returned by the 64-bit SWP doesn't make sense in that case.
>
> What did I miss?
This is fine, as the kernel lock will ensure that all memory accesses that are
participating are coherently visible.
1) All unaligned accesses participate in the lock
- If they cross an alignment granule (16B/64B)
2) All aligned accesses can not observe a tear
- They will only see one half of the atomic straddling the granule
- So even if it "tears" because another thread has written an
aligned half, the same would have happened on real x86 hardware.
>
> I also notice that your patch doesn't implement support for CAS and I
> worry that letting out partial updates from supposedly atomic stores
> could cause additional issues there. A CAS that reports failure could
> actually have mutated memory, which then feeds into the next iteration
> of the CAS loop in userspace.
I believe that's just because this patch is a simplification of our
full patch series
that we are shipping in downstream.
>
> Anyway, my lack of understanding aside, this doesn't look like an ABI we
> should be committing to on arm64.
>
> Cheers,
>
> Will
Powered by blists - more mailing lists