linux-kernel - [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250320190514.1961144-1-mjguzik@gmail.com>
Date: Thu, 20 Mar 2025 20:05:14 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: torvalds@...ux-foundation.org,
	x86@...nel.org
Cc: hkrzesin@...hat.com,
	tglx@...utronix.de,
	mingo@...hat.com,
	bp@...en8.de,
	dave.hansen@...ux.intel.com,
	hpa@...or.com,
	olichtne@...hat.com,
	atomasov@...hat.com,
	aokuliar@...hat.com,
	linux-kernel@...r.kernel.org,
	Mateusz Guzik <mjguzik@...il.com>
Subject: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
1 byte at a time loop to handle the tail.

This is trivially avoidable with overlapping stores, which is the
standard technique for these kind of routines (in fact memcpy() is
already using it in a more extensive form).

I traced calls in this range during a kernel build and found that 65% of
all of them had tail to take care of.

Distribution of size & 7 in that range is as follows:
@:
[0, 1)           3282178 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            949627 |@@@@@@@@@@@@@@@                                     |
[2, 3)           1078842 |@@@@@@@@@@@@@@@@@                                   |
[3, 4)            998995 |@@@@@@@@@@@@@@@                                     |
[4, 5)            744515 |@@@@@@@@@@@                                         |
[5, 6)            925437 |@@@@@@@@@@@@@@                                      |
[6, 7)            663305 |@@@@@@@@@@                                          |
[7, ...)          786007 |@@@@@@@@@@@@                                        |

@stats[notail]: 3282178
@stats[tail]: 6146728

Signed-off-by: Mateusz Guzik <mjguzik@...il.com>
---

I added a call to a custom copy_user_probe() routine to
copy_user_generic so that I can attach to it with bpftrace. Available
upon request. The one-liner is:

bpftrace -e 'kprobe:copy_user_probe /arg2 >= 8 && arg2 <= 64/ \
{ @ = lhist(arg2 & 7, 0, 7, 1); @stats[arg2 & 7 ? "tail" : "notail"] = count(); }'

Anyhow, I don't have any means to benchmark this at the moment as the
only hw I have access to is in fact a little too modern (FSRM), but this
being the standard technique I think wont require much convincing. If
anything the question is why this differs from memcpy which *does* use
overlapping stores.

Tested by forcing the kernel to use the routine, did the kernel build
just fine with the change.

That said, absent own bench results I'm not going to strongly argue for
the change but I do think it is an ok tidy-up until someone(tm) puts
more effort here.

 arch/x86/lib/copy_user_64.S | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341b2441..2d5f42c521b5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -65,10 +65,15 @@ SYM_FUNC_START(rep_movs_alternative)
 	je .Lexit
 	cmp $8,%ecx
 	jae .Lword
-	jmp .Lcopy_user_tail
+4:	movq -8(%rsi,%rcx),%rax
+5:	movq %rax,-8(%rdi,%rcx)
+	xorl %ecx,%ecx
+	RET

 	_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)

 .Llarge:
 0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
-- 
2.43.0