[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjYrfUDOd3VYhn8HvH7MCnampXt1TdtaXo86UDrT_rbMQ@mail.gmail.com>
Date: Wed, 2 Apr 2025 11:56:31 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Andrew Cooper <andrew.cooper3@...rix.com>
Cc: mjguzik@...il.com, linux-kernel@...r.kernel.org, mingo@...hat.com,
x86@...nel.org, "Peter Zijlstra (Intel)" <peterz@...radead.org>
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
inlined ops
On Wed, 2 Apr 2025 at 11:40, Andrew Cooper <andrew.cooper3@...rix.com> wrote:
>
> You still want the compiler to be able to do a first-pass optimisation
> over __builtin_mem*(), for elimination/merging/etc
Absolutely.
That's literally why the kernel ends up using __builtin_memcpy() in
the first place - so that the compiler can turn the (very common)
small fixed-size cases into just regular moves.
The function call would only be for actual big moves and/or unknown sizes.
That said, it can be advantageous to have more than one function call.
For the user copy case (where we control the horizontal and the
vertical in the kernel) at one point I did have a patch that did
something fancier than just a single call-out.
Because even when the compiler doesn't know the exact length, it's
often known that it's a multiple of four, for example. It's not all
that uncommon to see code like this:
memcpy(image->segment, segments, nr_segments * sizeof(*segments));
and an even more common case is "I'm doing an out-of-line call not
because the size is unknown, but because the size is big".
So I noted the size alignment of the size in the function name, so
that I could have different versions for the most obvious
straight-forward cases (notably the "just do word-at-a-time copies").
Admittedly my main use case for that was the big fixed-size case in
the stat code, which just does
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
and shows up like a sore thumb on some benchmarks.
But I've always hated the patch because it turns out that the real fix
is to get rid of the temporary buffer on the stack, and just copy the
fields one-by-one with 'unsafe_put_user()' and friends, but every time
I do that - and I've done it several times - I end up throwing the
patch away because it ends up being problematic on non-x86
architectures
(Reason: INIT_STRUCT_STAT64_PADDING(). Ugh).
Linus
Powered by blists - more mailing lists