linux-kernel - performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <mwwusvl7jllmck64xczeka42lglmsh7mlthuvmmqlmi5stp3na@raiwozh466wz>
Date: Thu, 27 Nov 2025 07:55:27 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: x86@...nel.org
Cc: glx@...utronix.de, mingo@...hat.com, bp@...en8.de, 
	dave.hansen@...ux.intel.com, hpa@...or.com, linux-kernel@...r.kernel.org, 
	torvalds@...ux-foundation.org, olichtne@...hat.com, atomasov@...hat.com, aokuliar@...hat.com
Subject: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids
 executing sync_regs()

Sapphire Rapids has both ERMS (of course) and FSRM.

sync_regs() runs into a corner case where both rep movsq and rep movsb
suffer massive penalty for being used to copy 168 bytes, which clear
itself when data is copied by a bunch of movq instead.

I verified the issue is not present on AMD EPYC 9454, I don't know about
other Intel CPUs.

Details:
When benchmarking page faults (page_fault1 from will-it-scale),
sync_regs() is very high on the profile performing a 168 byte copy with
rep movsq.

I figured movsq still sucks on the uarch, so I patched the kernel to use
movsb instead, but performance barely budged.

However, forcing the thing to do the copy with regular stores in
memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it.

Check this out (ops/s):

rep movsb in ___pi_memcpy:
min:1293689 max:1293689 total:1293689
min:1293969 max:1293969 total:1293969
min:1293845 max:1293845 total:1293845
min:1293436 max:1293436 total:1293436

hand-rolled mov loop in memcpy_orig:
min:1498050 max:1498050 total:1498050
min:1499041 max:1499041 total:1499041
min:1498283 max:1498283 total:1498283
min:1499701 max:1499701 total:1499701

... or just shy of 16% faster.

I patched the kernel with a tunable to select memcpy version for
sync_regs() to use, togglable at runtime. Results reliably flip around
as I change it at runtime.

perf top says:

rep movsb in ___pi_memcpy:
  25.20%  [kernel]                  [k] asm_exc_page_fault
  14.60%  [kernel]                  [k] __pi_memcpy
  11.78%  page_fault1_processes     [.] testcase
   4.71%  [kernel]                  [k] _raw_spin_lock
   2.36%  [kernel]                  [k] __handle_mm_fault
   2.00%  [kernel]                  [k] clear_page_erms
   
hand-rolled mov loop in memcpy_orig:
  27.99%  [kernel]               [k] asm_exc_page_fault
  13.42%  page_fault1_processes  [.] testcase
   5.46%  [kernel]               [k] _raw_spin_lock
   2.72%  [kernel]               [k] __handle_mm_fault
   2.48%  [kernel]               [k] clear_page_erms
   [..]
   0.59%  [kernel]		 [k] memcpy_orig
   0.04%  [kernel]		 [k] __pi_memcpy

As you can see the difference is staggering and this has to be a
deficiency at least in this uarch.

When it comes to sync_regs() specifically, I think it makes some sense
to instead recode in asm and perhaps issue the movs "by hand", which
would work around the immediate problem and shave off a function call
per page fault.

However, per the profile results above, there is at least one case where
rep movsb-based memcpy can grossly underperform and someone(tm) should
investigate what's going on there. Also note the kernel inlines plain
rep movsb usage for copy to/from user if FSRM is present, again possibly
being susceptible to whatever the problem is.

Maybe this is a matter of misalignment of the target or some other
bullshit, I have not tested and I don't have the time to dig into it.

I would expect someone better clued in the area will figure it out in
less time than I would need, hence I'm throwing this out there.

tunable usable as follows:
sysctl fs.magic_tunable=0 # rep movsb
sysctl fs.magic_tunable=1 # regular movs

hack:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6b22611e69cc..f5fd69b2dc5b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -915,6 +915,9 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 }
 
 #ifdef CONFIG_X86_64
+extern unsigned long magic_tunable;
+void *memcpy_orig(void *dest, const void *src, size_t n);
+
 /*
  * Help handler running on a per-cpu (IST or entry trampoline) stack
  * to switch to the normal thread stack if the interrupted code was in
@@ -923,8 +926,10 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
 	struct pt_regs *regs = (struct pt_regs *)current_top_of_stack() - 1;
-	if (regs != eregs)
-		*regs = *eregs;
+	if (!magic_tunable)
+		__memcpy(regs, eregs, sizeof(struct pt_regs));
+	else
+		memcpy_orig(regs, eregs, sizeof(struct pt_regs));
 	return regs;
 }
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 12a23fa7c44c..0f67378625b4 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -31,8 +31,6 @@
  * which the compiler could/should do much better anyway.
  */
 SYM_TYPED_FUNC_START(__memcpy)
-	ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
-
 	movq %rdi, %rax
 	movq %rdx, %rcx
 	rep movsb
@@ -44,7 +42,7 @@ SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
 SYM_PIC_ALIAS(memcpy)
 EXPORT_SYMBOL(memcpy)
 
-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
 	movq %rdi, %rax
 
 	cmpq $0x20, %rdx
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,