[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45155869-1490-49ab-8df1-7ad13f79c09a@linaro.org>
Date: Thu, 20 Feb 2025 09:54:38 -0800
From: Richard Henderson <richard.henderson@...aro.org>
To: "Maciej W. Rozycki" <macro@...am.me.uk>,
Ivan Kokshaysky <ink@...een.parts>, Matt Turner <mattst88@...il.com>
Cc: Arnd Bergmann <arnd@...db.de>,
John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
Magnus Lindholm <linmag7@...il.com>, "Paul E. McKenney"
<paulmck@...nel.org>, Linus Torvalds <torvalds@...ux-foundation.org>,
Al Viro <viro@...iv.linux.org.uk>, linux-alpha@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency
On 2/19/25 04:46, Maciej W. Rozycki wrote:
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15,
Pointer? I can't find it on the gcc-patches list.
> implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
> aligned whole data quantities spanned being read and stored for the
> reference with a subsequent STx_C operation, along with the width of
> the data accessed and its virtual address, and the task referring or
> NULL if the kernel. The valitidy marker is set.
>
> 2. Regular memory load operations are used to retrieve data because no
> atomicity is needed at this stage, and matching the width accessed,
> either LDQ_U or LDL even though the latter instruction requires extra
> operations, to avoid the case where an unaligned longword located
> entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
> intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
> being checked for being true, and the width of the data accessed
> along with the virtual address and the task referring or the kernel
> for a match. The pair of whole data quantities previously read by
> LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
> first whole data quantity is reread, and data retrieved compared with
> the value previously obtained. If there's no match, then the loop is
> aborted and 0 is returned in the intended register as the result of
> the trapping STx_C instruction and emulation completes. Otherwise
> new data obtained from the source operand of STx_C is combined with
> the data retrieved, replacing by byte insertion the part intended,
> and an atomic write of this new data is attempted. If it fails, the
> loop continues from the beginning. Otherwise processing proceeds to
> the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
> that no third-party intervening write has changed them at the point
> of the write from the values held at previous LDx_L. Therefore 1 is
> returned in the intended register as the result of the trapping STx_C
> instruction.
I think general-purpose non-atomic emulation of STx_C is a really bad idea.
Without looking at your gcc patches, I can guess what you're after: you've generated a
ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned.
Crucially, when emulating non-aligned, you should not strive to make it atomic. No other
architecture promises atomic non-aligned stores, so why should you do that here?
I suggest some sort of magic code sequence,
bic addr_in, 6, addr_al
loop:
ldq_l t0, 0(addr_al)
magic-nop done - loop
inswl data, addr_in, t1
mskwl t0, addr_in, t0
bis t0, t1, t0
stq_c t0, 0(addr_al)
beq t0, loop
done:
With the trap, match the magic-nop, pick out the input registers from the following inswl,
perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward
to the done label.
Choose anything you like for the magic-nop. The (done - loop) displacement is small, so
any 8-bit immediate would suffice. E.g. "eqv $31, disp, $31". You might require the
displacement to be constant and not actually extract "disp"; just match the entire
uint32_t instruction pattern.
r~
Powered by blists - more mailing lists