linux-kernel - Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wg_XHkR93Kx8dKrC7797nZECooAbwg0XjvjDeT1_jTohw@mail.gmail.com>
Date: Mon, 9 Jun 2025 12:25:08 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Uros Bizjak <ubizjak@...il.com>
Cc: David Laight <david.laight.linux@...il.com>, Peter Zijlstra <peterz@...radead.org>, 
	Mateusz Guzik <mjguzik@...il.com>, mingo@...hat.com, x86@...nel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Mon, 9 Jun 2025 at 09:38, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> What gcc has now for memcpy/memset is complicated and largely useless.

Just to clarify: I'm talking about the "pick between rep movs and
library call" parts of the gcc code. That's the part that then ends up
being about very random choices that can't be done well statically
because the exact choices depend very much on microarchitecture.

What is absolutely *not* useless is when the compiler decides to just
do the memcpy entirely by hand using regular mov instructions.

That's the main reason we end up no longer having our own memcpy
inlines and helpers - that and the fact that structure assignments etc
mean that we can't catch 'memcpy()' in the general case anyway.

So the whole "I'm turning this small and known-size memcpy into just X
pairs of 'mov' instructions" is a big deal. That part I love.

It's the "call to library routine or use string instructions" that I
don't like, and that I think the kernel would be better off picking
dynamically at boot time with instruction rewriting.

But to do a good job at that, we'd need that memcpy call to have the
string instruction semantics (including, very much, same clobber
rules).

And I do think we'd want to have hints as to size and alignment
because the whole "compiler knew about those, but then turned it into
a single special library call so that we can no longer optimize for
small/large/alignment cases" is sad.

So what I'd love to see is that if we have a

        large_struct_dest = large_struct_source;

then gcc would generate

        leaq dest,%rdi // or whatever
        leaq src,%rsi // again - obviously this will depend
        movl $size,%ecx
        call rep_movsb_large_aligned

so that we can take that target information into account when we rewrite it.

For example, on *some* microarchitectures, we'd decide to just always
replace all those calls with 'rep movsb', simply because the uarch is
known to be good at it.

But in *other* cases, we might only do it when we know the copy is
large (thus the need for a size hint).

And we might even be able to then turn that

        movl $size,%ecx
        call rep_movsb_large_aligned

pattern into

        movl $size/8,%ecx
        rep movsq

on older architectures that do better at 'movsq' than at 'movsb', but
have slow function calls due to retpoline crap.

Admittedly I don't think anybody has the energy to do those kinds of
bigger rewrites, but I think it would be good to have the _option_ if
somebody gets excited about it.

              Linus