linux-kernel - Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.2504092019200.18515@angie.orcam.me.uk>
Date: Wed, 9 Apr 2025 21:59:59 +0100 (BST)
From: "Maciej W. Rozycki" <macro@...am.me.uk>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
cc: Arnd Bergmann <arnd@...db.de>, 
    Linus Torvalds <torvalds@...ux-foundation.org>, 
    Richard Henderson <richard.henderson@...aro.org>, 
    Ivan Kokshaysky <ink@...een.parts>, Matt Turner <mattst88@...il.com>, 
    John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>, 
    Magnus Lindholm <linmag7@...il.com>, 
    "Paul E. McKenney" <paulmck@...nel.org>, 
    Alexander Viro <viro@...iv.linux.org.uk>, linux-alpha@...r.kernel.org, 
    linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data
 consistency

On Wed, 9 Apr 2025, Eric W. Biederman wrote:

> >> So unless you actually *see* the unaligned faults, I really think you
> >> shouldn't emulate them.
> >>
> >> And I'd like to know where they are if you do see them
> 
> I was nerd sniped by this so I took a look.
> 
> I have a distinct memory that even the ipv4 stack can generate unaligned
> loads.  Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
> there are several unprotected accesses to iph->daddr.
> 
> Which means that if the lower layers ever give something that is not 4
> byte aligned for ipv4 just reading the destination address will be an
> unaligned read.
> 
> There are similar unprotected accesses to the ipv6 destination address
> but it is declared as an array of bytes.  So that address can not
> be misaligned.
> 
> There is a theoretical path through 802.2 that adds a 3 byte sap
> header that could cause problems.  We have LLC_SAP_IP defined
> but I don't see anything calling register_8022_client that would
> be needed to hook that up to the ipv4 stack.
> 
> As long as the individual ethernet drivers have the hardware deliver
> packets 2 bytes into an aligned packet buffer the 14 byte ethernet
> header will end on a 16 byte aligned location, I don't think there
> is a way to trigger unaligned behavior with ipv4 or ipv6.
> 
> Hmm.  Looking appletalk appears to be built on top of SNAP.
> So after the ethernet header processing the code goes through
> net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
> reaching any of the appletalk protocols.
> 
> I think the common case for llc would be 3 bytes + 5 bytes for snap,
> for 8 bytes in the common case.  But the code seems to be reading
> 4 or 5 bytes for llc so I am confused.  In either case it definitely
> appears there are cases where the ethernet headers before appletalk
> can be an odd number of bytes which has the possibility of unaligning
> everything.
> 
> Both of the appletalk protocols appear to make unguarded 16bit reads
> from their headers.  So having a buffer that is only 1 byte aligned
> looks like it will definitely be a problem.

 Thank you for your analysis, really insightful.

> > FWIW, all the major architectures that have variants without
> > unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
> > them for both user and kernel access for normal memory, but
> > they don't emulate it for atomic ll/sc type instructions.
> > These instructions also trap and kill the task on the
> > architectures that can do hardware unaligned access (x86
> > cmpxchg8b being a notable exception).

 But all those architectures have 1-byte and 2-byte memory access machine 
instructions as well, and consequently none requires an RMW sequence to 
update such data quantities that implies the data consistency issue that 
we have on non-BWX Alpha.

> I don't see anything that would get atomics involved in the networking
> stack.  No READ_ONCE on packet data or anything like that.  I believe
> that is fairly fundamental as well.  Whatever is processing a packet is
> the only code processing that packet.
> 
> So I would be very surprised if the kernel needed emulation of any
> atomics, just emulation of normal unaligned reads.  I haven't looked to
> see if the transmission paths do things that will result in unaligned
> writes.

 The problem we have on the non-BWX Alpha target is that hardware has no 
memory access instructions narrower than 4 bytes.  Consequently to write a 
1- or 2-byte quantity an RMW instruction sequence is required, in the way 
of reading the whole 4-byte quantity, inserting the bytes to be modified, 
and writing the whole 4-byte quantity back to memory.  However such a 
sequence is not safe for concurrent writes, as described below.

 A pair of concurrent RMW sequences targetting the same part of an aligned 
4-byte data quantity is not an issue: it's just an execution race and 
software may be prepared for it (or otherwise either prevent the race via 
a mutex or alternatively use an atomic data type along with the associated 
accessors, which will move data locations in memory suitably apart).

 The issue is a pair of concurrent RMW sequences targetting different 
parts of the same aligned 4-byte data quantity: software can legitimately 
expect that writes to disjoint memory locations (e.g. adjacent struct 
members, except for bit-fields) won't affect each other.  But here where a 
pair of such RMW sequences runs interleaved, the later write to one 
location will clobber the value written previously to the other.  So we 
have a data race.  Note that no atomicity is concerned here, we are 
talking plain memory writes, such as with ordinary assignments to regular 
variables in C code.

 So I have come up with a solution where such RMW sequences are actually 
emitted by GCC as an LDL_L/STL_C atomic access loop which ensures that no 
intervening write has changed the aligned 4-byte data quantity containing 
the 1- or 2-byte quantity accessed.  This guarantees consistency of the 
part(s) of the aligned 4-byte data quantity *outside* the 1- or 2-byte 
quantity written.  Atomicity is guaranteed by hardware as a side effect, 
but not a part of this Alpha/Linux psABI extension (i.e. not in our 
contract).

 For known-unaligned 2-byte quantities (such as packed structure members) 
the compiler knows that they may span 2 aligned 4-byte data quantities and 
produces two LDL_L/STL_C loops with suitable address adjustments and data 
masking.  This still guarantess consistency of data *outside* the 2-byte 
quantity written.  No atomicity is guaranteed, because parts of the 2-byte 
quantity may be stored by pieces (if the 2-byte quantity is in the middle 
of an aligned 4-byte quantity, then it'll be written twice).

 The problem is with the case where the compiler has been told to produce 
code to write an aligned 2-byte quantity, but at run time it turns out 
unaligned.  Now we have to emulate the LDL_L and STL_C instructions of the 
atomic access loop or otherwise the code will crash.

 My approach for this scenario is simple: LDL_L emulation remembers the 
address accessed and data present in the 2 aligned 4-byte data quantities 
spanned, and STL_C emulation returns failure in the case of an address 
mismatch and otherwise uses two LDL_L/STL_C loops to load the the 2 
aligned 4-byte data quantities by piece, compare each with data retrieved 
previously at LDL_L emulation time, returning failure in the case of a 
mismatch, insert the requested value and then store the resulting 
quantity.  Again this guarantees consistency of the parts of the 2 aligned 
4-byte data quantities *outside* the unaligned 2-byte quantity written.  
And again, no atomicity is guaranteed.

 So while there are no atomic operations in our code at the C language 
level, we get them sneaked in by the compiler under our feet to solve the 
data consistency issue.  Now if we can ascertain the code paths concerned 
won't ever exercise concurrency, we could tell the compiler not to produce 
these atomics for 1-byte and 2-byte accesses, on a file-by-file or even 
function-by-function basis, but it seems to me like the very maintenance 
effort we want to avoid for a legacy platform.  Whereas if we build the 
kernel with the atomics enabled universally, we won't have to be bothered 
with analysing individual cases (at performance cost, but that's assumed).

 I've left 8-byte data quantities out for clarity from the consideration 
above; they're used by the compiler as suitable and handled accordingly.

 Let me know if you find anything here unclear.

  Maciej