lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251106160735.2638485-1-andrealmeid@igalia.com>
Date: Thu,  6 Nov 2025 13:07:34 -0300
From: André Almeida <andrealmeid@...lia.com>
To: Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	Mark Brown <broonie@...nel.org>
Cc: linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org,
	kernel-dev@...lia.com,
	Ryan Houdek <houdek.ryan@...-emu.org>,
	Billy Laws <blaws05@...il.com>,
	André Almeida <andrealmeid@...lia.com>
Subject: [RFC PATCH 0/1] arch: arm64: Implement unaligned atomic emulation

This patch proposes adding kernel-side emulation for unaligned atomic
instructions on ARM64. This is intended for x86 emulators (like FEX)
that struggle to effectively handle such operations in userspace alone.
Such handling is required as x86 permits such unaligned accesses (albeit
sometimes with a performance penalty as in the case of split-locks[1])
but ARM64 does not and will raise a bus error.

User applications that wish to enable support for this can use the new
pctrl() flag `PR_ARM64_UNALIGN_ATOMIC_EMULATE`. Some optimizations and
instructions were left for future revisions of this patchset.

Emulators like FEX attempt to emulate this in userspace, but with
caveats in two areas:

 * Performance

It should first be noted that due to x86's TSO (total store order)
memory model, ARM64 atomic instructions must be used for all memory
accesses. This results in unaligned loads/stores being much more common
than one would expect and the overhead of emulating them significantly
impacting performance.  For this common case of unaligned loads/stores,
code backpatching is used in FEX to avoid repeated overhead from
handling the same faulting access. This replaces faulting unaligned
atomic ARM64 instructions with regular load/stores and memory barriers.
This comes at a cost of introducing significant performance problems if
a function like memcpy ends up being patched because it very
infrequently happens to be used with unaligned memory. This is severe
enough to make games like Mirror's Edge and Assassin's Creed: Origin
unplayable without application-specific configuration.

Microbenchmarks[2] measure a 4x decrease in overhead with kernel-side
handling compared to userspace, and this figure is currently even larger
when FEX is ran under Wine. Such a dramatic decrease would make it
reasonable for FEX to default to the no-backpatching path and provide
consistent performance. 

 * Correctness:

x86 atomic accesses can cross 16-byte (LSE2) granules, but there is no
ARM64 instruction that would allow for direct atomic emulation of this.
As such, a lock must be taken for correctness. While this is easy to
emulate in userspace within a process, correct atomic cross-granule
operations on shared memory mapped into multiple processes would require
a lock shared between all FEX instances which cannot be implemented
safely in userspace as is (robust futexes do not work here as they are
under the control of the emulated x86 program). Note, this is a less
coarse restriction than split locks on x86, which are only concerned
with accesses crossing a 64 byte cacheline size.

 * Precedent:

Both XNU and NT kernels support unaligned atomic emulation for their
respective x86 emulators. Windows additionally supports 'volatile
metadata', which is emitted by newer versions of MSVC to inform
emulators which specific load/store accesses require atomic handling
[3]. FEX supports this together with an extension mechanism [4] which
can be manually populated to avoid e.g. the aforementioned Assassin's
Creed slowdown.

This implementation is a RFC so we can learn more about how to make this
code upstream and what the maintainers think of such feature being
merged here. The code is a simplified version of the original work done
by Billy Laws, where we accept just a subset of 64bit atomic
instructions that are enough to be used with a benchmark tool[2], and
this is the proposed interface being used by FEX: [5].

Thanks!
	André

[1] https://lwn.net/Articles/911219/
[2] https://gitlab.freedesktop.org/freedesktop/snippets/-/snippets/7875
[3] https://learn.microsoft.com/en-us/cpp/build/reference/volatile?view=msvc-170
[4] https://github.com/FEX-Emu/FEX/pull/4773
[5] https://github.com/FEX-Emu/FEX/pull/4985

André Almeida (1):
  arch: arm64: Implement unaligned atomic emulation

 arch/arm64/include/asm/exception.h   |   1 +
 arch/arm64/include/asm/processor.h   |   3 +
 arch/arm64/include/asm/thread_info.h |   1 +
 arch/arm64/kernel/Makefile           |   2 +-
 arch/arm64/kernel/process.c          |  15 +
 arch/arm64/kernel/unaligned_atomic.c | 520 +++++++++++++++++++++++++++
 arch/arm64/mm/fault.c                |  10 +
 include/uapi/linux/prctl.h           |   5 +
 kernel/sys.c                         |   7 +-
 9 files changed, 562 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/kernel/unaligned_atomic.c

-- 
2.51.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ