[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250605164733.737543-1-mjguzik@gmail.com>
Date: Thu, 5 Jun 2025 18:47:33 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: torvalds@...ux-foundation.org
Cc: mingo@...hat.com,
x86@...nel.org,
linux-kernel@...r.kernel.org,
Mateusz Guzik <mjguzik@...il.com>
Subject: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which
comes with a significant penalty on CPUs without the respective fast
short ops bits (FSRM/FSRS).
Another point is that even uarchs with FSRM don't necessarily have FSRS (Ice
Lake and Sapphire Rapids don't).
More importantly, rep movsq is not fast even if FSRM is present.
The issue got reported to upstream gcc, but no progress was made and it
looks like nothing will happen for the foreseeable future (see links
1-3).
In the meantime perf is left on the table, here is a sample result from
compilation of a hello world program in a loop (in compilations / s):
Sapphire Rapids:
before: 979
after: 997 (+1.8%)
AMD EPYC 9R14:
before: 808
after: 815 (+0.8%)
So this is very much visible outside of a microbenchmark setting.
This is very page fault heavy, which lands in sync_regs():
<+0>: endbr64
<+4>: mov %gs:0x22ca5d4(%rip),%rax # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>: mov %rdi,%rsi
<+15>: sub $0xa8,%rax
<+21>: cmp %rdi,%rax
<+24>: je 0xffffffff82244a55 <sync_regs+37>
<+26>: mov $0x15,%ecx
<+31>: mov %rax,%rdi
<+34>: rep movsq %ds:(%rsi),%es:(%rdi)
<+37>: jmp 0xffffffff82256ba0 <__x86_return_thunk>
When microbenchmarking page faults, perf top shows:
before:
22.07% [kernel] [k] asm_exc_page_fault
12.83% pf_processes [.] testcase
11.81% [kernel] [k] sync_regs
after:
26.06% [kernel] [k] asm_exc_page_fault
13.18% pf_processes [.] testcase
[..]
0.91% [kernel] [k] sync_regs
A massive reduction in execution time of the routine.
Link 1: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
Link 2: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
Link 3: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
Link 4: https://lore.kernel.org/oe-lkp/202504181042.54ea2b8a-lkp@intel.com/
Signed-off-by: Mateusz Guzik <mjguzik@...il.com>
---
v2:
- only do it if not building with CONFIG_X86_NATIVE_CPU
Hi Linus,
RFC for the patch was posted here:
https://lore.kernel.org/all/xmzxiwno5q3ordgia55wyqtjqbefxpami5wevwltcto52fehbv@ul44rsesp4kw/
You rejected it on 2 grounds:
- this should be handled by gcc itself -- agreed, but per the
interaction in the bzs I created for them I don't believe this will
happen any time soon (if ever to be frank)
- messing with local optimization flags -- perhaps ifdefing on
CONFIG_X86_NATIVE_CPU would be good enough? if not, the thing can be
hidden behind an option (default Y) so interested parties can whack it
See the commit message for perf numbers. It would be a shame to not get
these wins only because gcc is too stubborn.
While I completely understand not liking compiler-specific hacks, I
believe I made a good enough case for rolling with them here.
That said, if you don't see any justification to get something of this
sort in, I'm dropping the matter.
cheers
arch/x86/Makefile | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 1913d342969b..9eb75bd7c81d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,31 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
endif
endif
+ifdef CONFIG_CC_IS_GCC
+ifndef CONFIG_X86_NATIVE_CPU
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily as uarchs wildly differ between a
+# threshold past which rep-prefixed ops become faster. 256 being the lowest
+# common denominator. This should be fixed in the compiler.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+endif
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
--
2.48.1
Powered by blists - more mailing lists