[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250608215127.3b41ac1d@pumpkin>
Date: Sun, 8 Jun 2025 21:51:27 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Mateusz Guzik
<mjguzik@...il.com>, torvalds@...ux-foundation.org, mingo@...hat.com,
x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
inlined ops
On Fri, 6 Jun 2025 09:27:07 +0200
Uros Bizjak <ubizjak@...il.com> wrote:
> On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:
> > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > comes with a significant penalty on CPUs without the respective fast
> > > short ops bits (FSRM/FSRS).
> >
> > I don't suppose there's a magic compiler toggle to make it emit prefix
> > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > right?
> >
> > Something like:
> >
> > 2e 2e 2e f3 a4 cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> >
> > because if we can get the compilers to do this; then I can get objtool
> > to collect all these locations and then we can runtime patch them to be:
> >
> > call rep_movs_alternative / rep_stos_alternative
> >
> > or whatever other crap we want really.
>
> BTW: You can achieve the same effect by using -mstringop-strategy=libcall
>
> Please consider the following testcase:
>
> --cut here--
> struct a { int r[40]; };
> struct a foo (struct a b) { return b; }
> --cut here--
>
> By default, the compiler emits SSE copy (-O2):
>
> foo:
> .LFB0:
> .cfi_startproc
> movdqu 8(%rsp), %xmm0
> movq %rdi, %rax
> movups %xmm0, (%rdi)
> movdqu 24(%rsp), %xmm0
> movups %xmm0, 16(%rdi)
> ...
> movdqu 152(%rsp), %xmm0
> movups %xmm0, 144(%rdi)
> ret
>
> but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):
>
> foo:
> movq 8(%rsp), %rdx
> movq %rdi, %rax
> leaq 8(%rdi), %rdi
> leaq 8(%rsp), %rsi
> movq %rax, %rcx
> movq %rdx, -8(%rdi)
> movq 160(%rsp), %rdx
> movq %rdx, 144(%rdi)
> andq $-8, %rdi
> subq %rdi, %rcx
> subq %rcx, %rsi
> addl $160, %ecx
> shrl $3, %ecx
> rep movsq
> ret
>
> Please note the code that aligns pointers before "rep movsq".
Do you ever want it?
From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
alignment makes almost no difference to throughput.
I don't have any old zen cpu though.
On zen5 pretty much the only thing that matters is cache-line aligning
the destination buffer - but there are some strange oddities.
I need to revisit my 'rep mosvb' benchmarks though.
If you make %cx depend on the initial timestamp (cx = cx + (timestamp & zero))
will do it, and then make the final timestamp depend on a result of the copy
(easiest if using the performance counters) you should get a pretty true
value for the setup cost (pretty impossibly if you try to synchronise with
lfence or mfence).
David
Powered by blists - more mailing lists