linux-kernel - Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250608215127.3b41ac1d@pumpkin>
Date: Sun, 8 Jun 2025 21:51:27 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Mateusz Guzik
 <mjguzik@...il.com>, torvalds@...ux-foundation.org, mingo@...hat.com,
 x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Fri, 6 Jun 2025 09:27:07 +0200
Uros Bizjak <ubizjak@...il.com> wrote:

> On Thu, Jun 5, 2025 at 9:00 PM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Thu, Jun 05, 2025 at 06:47:33PM +0200, Mateusz Guzik wrote:  
> > > gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
> > > comes with a significant penalty on CPUs without the respective fast
> > > short ops bits (FSRM/FSRS).  
> >
> > I don't suppose there's a magic compiler toggle to make it emit prefix
> > padded 'rep movs'/'rep stos' variants such that they are 5 bytes each,
> > right?
> >
> > Something like:
> >
> >    2e 2e 2e f3 a4          cs cs rep movsb %ds:(%rsi),%es:(%rdi)
> >
> > because if we can get the compilers to do this; then I can get objtool
> > to collect all these locations and then we can runtime patch them to be:
> >
> >    call rep_movs_alternative / rep_stos_alternative
> >
> > or whatever other crap we want really.  
> 
> BTW: You can achieve the same effect by using -mstringop-strategy=libcall
> 
> Please consider the following testcase:
> 
> --cut here--
> struct a { int r[40]; };
> struct a foo (struct a b) { return b; }
> --cut here--
> 
> By default, the compiler emits SSE copy (-O2):
> 
> foo:
> .LFB0:
>        .cfi_startproc
>        movdqu  8(%rsp), %xmm0
>        movq    %rdi, %rax
>        movups  %xmm0, (%rdi)
>        movdqu  24(%rsp), %xmm0
>        movups  %xmm0, 16(%rdi)
>        ...
>        movdqu  152(%rsp), %xmm0
>        movups  %xmm0, 144(%rdi)
>        ret
> 
> but kernel doesn't enable SSE, so the compiler falls back to (-O2 -mno-sse):
> 
> foo:
>        movq    8(%rsp), %rdx
>        movq    %rdi, %rax
>        leaq    8(%rdi), %rdi
>        leaq    8(%rsp), %rsi
>        movq    %rax, %rcx
>        movq    %rdx, -8(%rdi)
>        movq    160(%rsp), %rdx
>        movq    %rdx, 144(%rdi)
>        andq    $-8, %rdi
>        subq    %rdi, %rcx
>        subq    %rcx, %rsi
>        addl    $160, %ecx
>        shrl    $3, %ecx
>        rep movsq
>        ret
> 
> Please note the code that aligns pointers before "rep movsq".

Do you ever want it?
From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
alignment makes almost no difference to throughput.
I don't have any old zen cpu though.
On zen5 pretty much the only thing that matters is cache-line aligning
the destination buffer - but there are some strange oddities.

I need to revisit my 'rep mosvb' benchmarks though.
If you make %cx depend on the initial timestamp (cx = cx + (timestamp & zero))
will do it, and then make the final timestamp depend on a result of the copy
(easiest if using the performance counters) you should get a pretty true
value for the setup cost (pretty impossibly if you try to synchronise with
lfence or mfence).

	David