linux-kernel - [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <xmzxiwno5q3ordgia55wyqtjqbefxpami5wevwltcto52fehbv@ul44rsesp4kw>
Date: Wed, 2 Apr 2025 15:42:40 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: torvalds@...ux-foundation.org, mingo@...hat.com
Cc: x86@...nel.org, linux-kernel@...r.kernel.org
Subject: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

Not a real submission yet as I would like results from other people.

tl;dr when benchmarking compilation of a hello-world program I'm getting
a 1.7% increase in throughput on Sapphire Rapids when convincing the
compiler to only use regular stores for inlined memset and memcpy

Note this uarch does have FSRM and still benefits from not using it for
some cases.

I am not in position to bench this on other CPUs, would be nice if
someone did it on AMD.

Onto the business:
The kernel is chock full of inlined rep movsq and rep stosq, including
in hot paths and these are known to be detrimental to performance below
certain sizes.

Most notably in sync_regs:
<+0>:     endbr64
<+4>:     mov    %gs:0x22ca5d4(%rip),%rax        # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>:    mov    %rdi,%rsi
<+15>:    sub    $0xa8,%rax
<+21>:    cmp    %rdi,%rax
<+24>:    je     0xffffffff82244a55 <sync_regs+37>
<+26>:    mov    $0x15,%ecx
<+31>:    mov    %rax,%rdi
<+34>:    rep movsq %ds:(%rsi),%es:(%rdi)
<+37>:    jmp    0xffffffff82256ba0 <__x86_return_thunk>

When issuing hello-world compiles in a loop this is over 1% of total CPU
time as reported by perf. With the kernel recompiled to instead do a
copy with regular stores this drops to 0.13%.

Recompiled it looks like this:
<+0>:     endbr64
<+4>:     mov    %gs:0x22b9f44(%rip),%rax        # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>:    sub    $0xa8,%rax
<+18>:    cmp    %rdi,%rax
<+21>:    je     0xffffffff82255114 <sync_regs+84>
<+23>:    xor    %ecx,%ecx
<+25>:    mov    %ecx,%edx
<+27>:    add    $0x20,%ecx
<+30>:    mov    (%rdi,%rdx,1),%r10
<+34>:    mov    0x8(%rdi,%rdx,1),%r9
<+39>:    mov    0x10(%rdi,%rdx,1),%r8
<+44>:    mov    0x18(%rdi,%rdx,1),%rsi
<+49>:    mov    %r10,(%rax,%rdx,1)
<+53>:    mov    %r9,0x8(%rax,%rdx,1)
<+58>:    mov    %r8,0x10(%rax,%rdx,1)
<+63>:    mov    %rsi,0x18(%rax,%rdx,1)
<+68>:    cmp    $0xa0,%ecx
<+74>:    jb     0xffffffff822550d9 <sync_regs+25>
<+76>:    mov    (%rdi,%rcx,1),%rdx
<+80>:    mov    %rdx,(%rax,%rcx,1)
<+84>:    jmp    0xffffffff822673e0 <__x86_return_thunk>

bloat-o-meter says:
Total: Before=30021301, After=30089151, chg +0.23%

There are of course other spots which are modified and they also see a
reduction in time spent.

Bench results in compilations completed in a 10 second period with /tmp
backed by tmpfs:

before:
978 ops (97 ops/s)
979 ops (97 ops/s)
978 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)

after:
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
996 ops (99 ops/s)

I'm running this with debian 12 userspace (gcc 12.2.0).

I asked the LKP folk to bench but did not get a response yet:
https://lore.kernel.org/oe-lkp/CAGudoHHd8TkyA1kOQ2KtZdZJ2VxUW=2mP-JR0t_oR07TfrwN8w@mail.gmail.com/

Repro instructions:
for i in $(seq 1 10); do taskset --cpu-list 1 ./ccbench 10; done

taskset is important as otherwise processes roam around the box big
time.

Attached files are:
- cc.c for will-it-scale if someone wants to profile the thing while it
  loops indefinitely
- src0.c -- hello world for reference, plop into /src/src0.c
- ccbench.c is the bench; compile with cc -O2 -o ccbench ccbench.c

It spawns gcc through system() forcing it to go through the shell, which
mimicks what happens when compiling with make.

 arch/x86/Makefile | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9b76e77ff7f7..1a1afcc3041f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
     endif
 endif
 
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
-- 
2.43.0


View attachment "ccbench.c" of type "text/x-csrc" (803 bytes)

View attachment "src0.c" of type "text/x-csrc" (66 bytes)

View attachment "cc.c" of type "text/x-csrc" (325 bytes)