lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250320190514.1961144-1-mjguzik@gmail.com>
Date: Thu, 20 Mar 2025 20:05:14 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: torvalds@...ux-foundation.org,
	x86@...nel.org
Cc: hkrzesin@...hat.com,
	tglx@...utronix.de,
	mingo@...hat.com,
	bp@...en8.de,
	dave.hansen@...ux.intel.com,
	hpa@...or.com,
	olichtne@...hat.com,
	atomasov@...hat.com,
	aokuliar@...hat.com,
	linux-kernel@...r.kernel.org,
	Mateusz Guzik <mjguzik@...il.com>
Subject: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
1 byte at a time loop to handle the tail.

This is trivially avoidable with overlapping stores, which is the
standard technique for these kind of routines (in fact memcpy() is
already using it in a more extensive form).

I traced calls in this range during a kernel build and found that 65% of
all of them had tail to take care of.

Distribution of size & 7 in that range is as follows:
@:
[0, 1)           3282178 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            949627 |@@@@@@@@@@@@@@@                                     |
[2, 3)           1078842 |@@@@@@@@@@@@@@@@@                                   |
[3, 4)            998995 |@@@@@@@@@@@@@@@                                     |
[4, 5)            744515 |@@@@@@@@@@@                                         |
[5, 6)            925437 |@@@@@@@@@@@@@@                                      |
[6, 7)            663305 |@@@@@@@@@@                                          |
[7, ...)          786007 |@@@@@@@@@@@@                                        |

@stats[notail]: 3282178
@stats[tail]: 6146728

Signed-off-by: Mateusz Guzik <mjguzik@...il.com>
---

I added a call to a custom copy_user_probe() routine to
copy_user_generic so that I can attach to it with bpftrace. Available
upon request. The one-liner is:

bpftrace -e 'kprobe:copy_user_probe /arg2 >= 8 && arg2 <= 64/ \
{ @ = lhist(arg2 & 7, 0, 7, 1); @stats[arg2 & 7 ? "tail" : "notail"] = count(); }'

Anyhow, I don't have any means to benchmark this at the moment as the
only hw I have access to is in fact a little too modern (FSRM), but this
being the standard technique I think wont require much convincing. If
anything the question is why this differs from memcpy which *does* use
overlapping stores.

Tested by forcing the kernel to use the routine, did the kernel build
just fine with the change.

That said, absent own bench results I'm not going to strongly argue for
the change but I do think it is an ok tidy-up until someone(tm) puts
more effort here.

 arch/x86/lib/copy_user_64.S | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341b2441..2d5f42c521b5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -65,10 +65,15 @@ SYM_FUNC_START(rep_movs_alternative)
 	je .Lexit
 	cmp $8,%ecx
 	jae .Lword
-	jmp .Lcopy_user_tail
+4:	movq -8(%rsi,%rcx),%rax
+5:	movq %rax,-8(%rdi,%rcx)
+	xorl %ecx,%ecx
+	RET
 
 	_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
 
 .Llarge:
 0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ