[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wg_XHkR93Kx8dKrC7797nZECooAbwg0XjvjDeT1_jTohw@mail.gmail.com>
Date: Mon, 9 Jun 2025 12:25:08 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Uros Bizjak <ubizjak@...il.com>
Cc: David Laight <david.laight.linux@...il.com>, Peter Zijlstra <peterz@...radead.org>,
Mateusz Guzik <mjguzik@...il.com>, mingo@...hat.com, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
inlined ops
On Mon, 9 Jun 2025 at 09:38, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> What gcc has now for memcpy/memset is complicated and largely useless.
Just to clarify: I'm talking about the "pick between rep movs and
library call" parts of the gcc code. That's the part that then ends up
being about very random choices that can't be done well statically
because the exact choices depend very much on microarchitecture.
What is absolutely *not* useless is when the compiler decides to just
do the memcpy entirely by hand using regular mov instructions.
That's the main reason we end up no longer having our own memcpy
inlines and helpers - that and the fact that structure assignments etc
mean that we can't catch 'memcpy()' in the general case anyway.
So the whole "I'm turning this small and known-size memcpy into just X
pairs of 'mov' instructions" is a big deal. That part I love.
It's the "call to library routine or use string instructions" that I
don't like, and that I think the kernel would be better off picking
dynamically at boot time with instruction rewriting.
But to do a good job at that, we'd need that memcpy call to have the
string instruction semantics (including, very much, same clobber
rules).
And I do think we'd want to have hints as to size and alignment
because the whole "compiler knew about those, but then turned it into
a single special library call so that we can no longer optimize for
small/large/alignment cases" is sad.
So what I'd love to see is that if we have a
large_struct_dest = large_struct_source;
then gcc would generate
leaq dest,%rdi // or whatever
leaq src,%rsi // again - obviously this will depend
movl $size,%ecx
call rep_movsb_large_aligned
so that we can take that target information into account when we rewrite it.
For example, on *some* microarchitectures, we'd decide to just always
replace all those calls with 'rep movsb', simply because the uarch is
known to be good at it.
But in *other* cases, we might only do it when we know the copy is
large (thus the need for a size hint).
And we might even be able to then turn that
movl $size,%ecx
call rep_movsb_large_aligned
pattern into
movl $size/8,%ecx
rep movsq
on older architectures that do better at 'movsq' than at 'movsb', but
have slow function calls due to retpoline crap.
Admittedly I don't think anybody has the energy to do those kinds of
bigger rewrites, but I think it would be good to have the _option_ if
somebody gets excited about it.
Linus
Powered by blists - more mailing lists