[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <mwwusvl7jllmck64xczeka42lglmsh7mlthuvmmqlmi5stp3na@raiwozh466wz>
Date: Thu, 27 Nov 2025 07:55:27 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: x86@...nel.org
Cc: glx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com
Subject: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids
executing sync_regs()
Sapphire Rapids has both ERMS (of course) and FSRM.
sync_regs() runs into a corner case where both rep movsq and rep movsb
suffer massive penalty for being used to copy 168 bytes, which clear
itself when data is copied by a bunch of movq instead.
I verified the issue is not present on AMD EPYC 9454, I don't know about
other Intel CPUs.
Details:
When benchmarking page faults (page_fault1 from will-it-scale),
sync_regs() is very high on the profile performing a 168 byte copy with
rep movsq.
I figured movsq still sucks on the uarch, so I patched the kernel to use
movsb instead, but performance barely budged.
However, forcing the thing to do the copy with regular stores in
memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it.
Check this out (ops/s):
rep movsb in ___pi_memcpy:
min:1293689 max:1293689 total:1293689
min:1293969 max:1293969 total:1293969
min:1293845 max:1293845 total:1293845
min:1293436 max:1293436 total:1293436
hand-rolled mov loop in memcpy_orig:
min:1498050 max:1498050 total:1498050
min:1499041 max:1499041 total:1499041
min:1498283 max:1498283 total:1498283
min:1499701 max:1499701 total:1499701
... or just shy of 16% faster.
I patched the kernel with a tunable to select memcpy version for
sync_regs() to use, togglable at runtime. Results reliably flip around
as I change it at runtime.
perf top says:
rep movsb in ___pi_memcpy:
25.20% [kernel] [k] asm_exc_page_fault
14.60% [kernel] [k] __pi_memcpy
11.78% page_fault1_processes [.] testcase
4.71% [kernel] [k] _raw_spin_lock
2.36% [kernel] [k] __handle_mm_fault
2.00% [kernel] [k] clear_page_erms
hand-rolled mov loop in memcpy_orig:
27.99% [kernel] [k] asm_exc_page_fault
13.42% page_fault1_processes [.] testcase
5.46% [kernel] [k] _raw_spin_lock
2.72% [kernel] [k] __handle_mm_fault
2.48% [kernel] [k] clear_page_erms
[..]
0.59% [kernel] [k] memcpy_orig
0.04% [kernel] [k] __pi_memcpy
As you can see the difference is staggering and this has to be a
deficiency at least in this uarch.
When it comes to sync_regs() specifically, I think it makes some sense
to instead recode in asm and perhaps issue the movs "by hand", which
would work around the immediate problem and shave off a function call
per page fault.
However, per the profile results above, there is at least one case where
rep movsb-based memcpy can grossly underperform and someone(tm) should
investigate what's going on there. Also note the kernel inlines plain
rep movsb usage for copy to/from user if FSRM is present, again possibly
being susceptible to whatever the problem is.
Maybe this is a matter of misalignment of the target or some other
bullshit, I have not tested and I don't have the time to dig into it.
I would expect someone better clued in the area will figure it out in
less time than I would need, hence I'm throwing this out there.
tunable usable as follows:
sysctl fs.magic_tunable=0 # rep movsb
sysctl fs.magic_tunable=1 # regular movs
hack:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6b22611e69cc..f5fd69b2dc5b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -915,6 +915,9 @@ DEFINE_IDTENTRY_RAW(exc_int3)
}
#ifdef CONFIG_X86_64
+extern unsigned long magic_tunable;
+void *memcpy_orig(void *dest, const void *src, size_t n);
+
/*
* Help handler running on a per-cpu (IST or entry trampoline) stack
* to switch to the normal thread stack if the interrupted code was in
@@ -923,8 +926,10 @@ DEFINE_IDTENTRY_RAW(exc_int3)
asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
{
struct pt_regs *regs = (struct pt_regs *)current_top_of_stack() - 1;
- if (regs != eregs)
- *regs = *eregs;
+ if (!magic_tunable)
+ __memcpy(regs, eregs, sizeof(struct pt_regs));
+ else
+ memcpy_orig(regs, eregs, sizeof(struct pt_regs));
return regs;
}
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 12a23fa7c44c..0f67378625b4 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -31,8 +31,6 @@
* which the compiler could/should do much better anyway.
*/
SYM_TYPED_FUNC_START(__memcpy)
- ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
-
movq %rdi, %rax
movq %rdx, %rcx
rep movsb
@@ -44,7 +42,7 @@ SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
SYM_PIC_ALIAS(memcpy)
EXPORT_SYMBOL(memcpy)
-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
movq %rdi, %rax
cmpq $0x20, %rdx
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
Powered by blists - more mailing lists