linux-kernel - Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wg1qQLWKPyvxxZnXwboT48--LKJuCJjF8pHdHRxv0U7wQ@mail.gmail.com>
Date: Mon, 9 Jun 2025 09:38:04 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Uros Bizjak <ubizjak@...il.com>
Cc: David Laight <david.laight.linux@...il.com>, Peter Zijlstra <peterz@...radead.org>, 
	Mateusz Guzik <mjguzik@...il.com>, mingo@...hat.com, x86@...nel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Sun, 8 Jun 2025 at 23:04, Uros Bizjak <ubizjak@...il.com> wrote:
>
> Please note that the instruction is "rep movsQ", it moves 64bit
> quantities. The alignment is needed to align data to the 64-bit
> boundary.

On real code, the cost of aligning things can be worse than just doing
it, and that's particularly true when inlining things.

When you clear 8 bytes on x86, you don't align things. You just write
a single 'movq'.

For the kernel, I$ misses are a very real thing, and the cost of
alignment is often the fact that you made things three times bigger
than they needed to be, and that it might make you uninline things.

Function calls are quite expensive - partly because of all the horrid
CPU bug workarounds (that are often *MORE* costly than any microcode
overhead of 'rep movs', which some people don't seem to have realized)
- but also because nonlinear code is simply a *big* hit when you don't
have good I$ behavior.

Benchmarks often don't show those effects. The benchmarks that have
big enough I$ footprints for those issues to show up are sadly not the
ones that compiler or library people use, which then results in fancy
memcpy routines that are actually really bad in real life.

That said, I at one point had a local patch that did a "memcpy_a4/a8"
for when the source and destination types were aligned and the right
size and also had the size as a hint (ie it called "memcpy_a8_large"
when types were 8-byte aligned and larger than 256 bytes iirc), so
that we could then do the right thing and avoid alignment and size
checks when we had enough of a clue that it's likely a bad idea (note
the "likely" - particularly for user copies the type may be aligned,
but user space might have place it unaligned anyway).

And honestly, that's what I'd like to see gcc generate natively: a
call instruction with the "rep movsb" semantics (so %rsi/rdi/%rcx
arguments, and they get clobbered).

That way, we could rewrite it in place and just replace it with "rep
movsb" when we know the hardware is good at it (or - as mentioned -
when we know the hardware is particularly bad at function calls: the
retpoline crap really is horrendously expensive with forced branch
mispredicts).

And then have a simple library routine (or a couple) for the other
cases. I'd like gcc to give the alignment hint and size hints too,
even if I suspect that we'd just make all versions be aliases to the
generic case, because once it's a function call, the rest tends to be
in the noise.

What gcc has now for memcpy/memset is complicated and largely useless.
I think it has been done for all the wrong reasons (ie spec type
benchmarking where you optimize for a known target CPU, which is bogus
garbage).

               Linus