lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <874iywtywh.fsf@email.froward.int.ebiederm.org>
Date: Wed, 09 Apr 2025 23:37:34 -0500
From: "Eric W. Biederman" <ebiederm@...ssion.com>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Arnd Bergmann <arnd@...db.de>,  Linus Torvalds
 <torvalds@...ux-foundation.org>,  Richard Henderson
 <richard.henderson@...aro.org>,  Ivan Kokshaysky <ink@...een.parts>,  Matt
 Turner <mattst88@...il.com>,  John Paul Adrian Glaubitz
 <glaubitz@...sik.fu-berlin.de>,  Magnus Lindholm <linmag7@...il.com>,
  "Paul E. McKenney" <paulmck@...nel.org>,  Alexander Viro
 <viro@...iv.linux.org.uk>,  linux-alpha@...r.kernel.org,
  linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

"Maciej W. Rozycki" <macro@...am.me.uk> writes:

> On Wed, 9 Apr 2025, Eric W. Biederman wrote:
>
>> >> So unless you actually *see* the unaligned faults, I really think you
>> >> shouldn't emulate them.
>> >>
>> >> And I'd like to know where they are if you do see them
>> 
>> I was nerd sniped by this so I took a look.
>> 
>> I have a distinct memory that even the ipv4 stack can generate unaligned
>> loads.  Looking at the code in net/ipv4/ip_input.c:ip_rcv_finish_core
>> there are several unprotected accesses to iph->daddr.
>> 
>> Which means that if the lower layers ever give something that is not 4
>> byte aligned for ipv4 just reading the destination address will be an
>> unaligned read.
>> 
>> There are similar unprotected accesses to the ipv6 destination address
>> but it is declared as an array of bytes.  So that address can not
>> be misaligned.
>> 
>> There is a theoretical path through 802.2 that adds a 3 byte sap
>> header that could cause problems.  We have LLC_SAP_IP defined
>> but I don't see anything calling register_8022_client that would
>> be needed to hook that up to the ipv4 stack.
>> 
>> As long as the individual ethernet drivers have the hardware deliver
>> packets 2 bytes into an aligned packet buffer the 14 byte ethernet
>> header will end on a 16 byte aligned location, I don't think there
>> is a way to trigger unaligned behavior with ipv4 or ipv6.
>> 
>> Hmm.  Looking appletalk appears to be built on top of SNAP.
>> So after the ethernet header processing the code goes through
>> net/llc/llc_input.c:llc_rcv and then net/802/snap_rcv before
>> reaching any of the appletalk protocols.
>> 
>> I think the common case for llc would be 3 bytes + 5 bytes for snap,
>> for 8 bytes in the common case.  But the code seems to be reading
>> 4 or 5 bytes for llc so I am confused.  In either case it definitely
>> appears there are cases where the ethernet headers before appletalk
>> can be an odd number of bytes which has the possibility of unaligning
>> everything.
>> 
>> Both of the appletalk protocols appear to make unguarded 16bit reads
>> from their headers.  So having a buffer that is only 1 byte aligned
>> looks like it will definitely be a problem.
>
>  Thank you for your analysis, really insightful.
>
>> > FWIW, all the major architectures that have variants without
>> > unaligned load/store (arm32, mips, ppc, riscv) trap and emulate
>> > them for both user and kernel access for normal memory, but
>> > they don't emulate it for atomic ll/sc type instructions.
>> > These instructions also trap and kill the task on the
>> > architectures that can do hardware unaligned access (x86
>> > cmpxchg8b being a notable exception).
>
>  But all those architectures have 1-byte and 2-byte memory access machine 
> instructions as well, and consequently none requires an RMW sequence to 
> update such data quantities that implies the data consistency issue that 
> we have on non-BWX Alpha.
>
>> I don't see anything that would get atomics involved in the networking
>> stack.  No READ_ONCE on packet data or anything like that.  I believe
>> that is fairly fundamental as well.  Whatever is processing a packet is
>> the only code processing that packet.
>> 
>> So I would be very surprised if the kernel needed emulation of any
>> atomics, just emulation of normal unaligned reads.  I haven't looked to
>> see if the transmission paths do things that will result in unaligned
>> writes.
>
>  The problem we have on the non-BWX Alpha target is that hardware has no 
> memory access instructions narrower than 4 bytes.  Consequently to write a 
> 1- or 2-byte quantity an RMW instruction sequence is required, in the way 
> of reading the whole 4-byte quantity, inserting the bytes to be modified, 
> and writing the whole 4-byte quantity back to memory.  However such a 
> sequence is not safe for concurrent writes, as described below.
>
>  A pair of concurrent RMW sequences targetting the same part of an aligned 
> 4-byte data quantity is not an issue: it's just an execution race and 
> software may be prepared for it (or otherwise either prevent the race via 
> a mutex or alternatively use an atomic data type along with the associated 
> accessors, which will move data locations in memory suitably apart).
>
>  The issue is a pair of concurrent RMW sequences targetting different 
> parts of the same aligned 4-byte data quantity: software can legitimately 
> expect that writes to disjoint memory locations (e.g. adjacent struct 
> members, except for bit-fields) won't affect each other.  But here where a 
> pair of such RMW sequences runs interleaved, the later write to one 
> location will clobber the value written previously to the other.  So we 
> have a data race.  Note that no atomicity is concerned here, we are 
> talking plain memory writes, such as with ordinary assignments to regular 
> variables in C code.
>
>  So I have come up with a solution where such RMW sequences are actually 
> emitted by GCC as an LDL_L/STL_C atomic access loop which ensures that no 
> intervening write has changed the aligned 4-byte data quantity containing 
> the 1- or 2-byte quantity accessed.  This guarantees consistency of the 
> part(s) of the aligned 4-byte data quantity *outside* the 1- or 2-byte 
> quantity written.  Atomicity is guaranteed by hardware as a side effect, 
> but not a part of this Alpha/Linux psABI extension (i.e. not in our 
> contract).
>
>  For known-unaligned 2-byte quantities (such as packed structure members) 
> the compiler knows that they may span 2 aligned 4-byte data quantities and 
> produces two LDL_L/STL_C loops with suitable address adjustments and data 
> masking.  This still guarantess consistency of data *outside* the 2-byte 
> quantity written.  No atomicity is guaranteed, because parts of the 2-byte 
> quantity may be stored by pieces (if the 2-byte quantity is in the middle 
> of an aligned 4-byte quantity, then it'll be written twice).
>
>  The problem is with the case where the compiler has been told to produce 
> code to write an aligned 2-byte quantity, but at run time it turns out 
> unaligned.  Now we have to emulate the LDL_L and STL_C instructions of the 
> atomic access loop or otherwise the code will crash.
>
>  My approach for this scenario is simple: LDL_L emulation remembers the 
> address accessed and data present in the 2 aligned 4-byte data quantities 
> spanned, and STL_C emulation returns failure in the case of an address 
> mismatch and otherwise uses two LDL_L/STL_C loops to load the the 2 
> aligned 4-byte data quantities by piece, compare each with data retrieved 
> previously at LDL_L emulation time, returning failure in the case of a 
> mismatch, insert the requested value and then store the resulting 
> quantity.  Again this guarantees consistency of the parts of the 2 aligned 
> 4-byte data quantities *outside* the unaligned 2-byte quantity written.  
> And again, no atomicity is guaranteed.
>
>  So while there are no atomic operations in our code at the C language 
> level, we get them sneaked in by the compiler under our feet to solve the 
> data consistency issue.  Now if we can ascertain the code paths concerned 
> won't ever exercise concurrency, we could tell the compiler not to produce 
> these atomics for 1-byte and 2-byte accesses, on a file-by-file or even 
> function-by-function basis, but it seems to me like the very maintenance 
> effort we want to avoid for a legacy platform.  Whereas if we build the 
> kernel with the atomics enabled universally, we won't have to be bothered 
> with analysing individual cases (at performance cost, but that's assumed).
>
>  I've left 8-byte data quantities out for clarity from the consideration 
> above; they're used by the compiler as suitable and handled accordingly.
>
>  Let me know if you find anything here unclear.

The emulation you are doing makes sense.

Just a few more points.  I am not current but I have never seen
concurrency (inside of a packet) at the network layer.

I don't recall ever hearing the write paths in the network stack
were ever a problem.

I suspect the write side you can verify fairly easily by simply
compiling in appletalk and opening a PF_APPLETALK socket, and sending a
message.  If that doesn't trigger emulation I can't image any other
write path in the kernel will.

Eric

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ