[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250320190514.1961144-1-mjguzik@gmail.com>
Date: Thu, 20 Mar 2025 20:05:14 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: torvalds@...ux-foundation.org,
x86@...nel.org
Cc: hkrzesin@...hat.com,
tglx@...utronix.de,
mingo@...hat.com,
bp@...en8.de,
dave.hansen@...ux.intel.com,
hpa@...or.com,
olichtne@...hat.com,
atomasov@...hat.com,
aokuliar@...hat.com,
linux-kernel@...r.kernel.org,
Mateusz Guzik <mjguzik@...il.com>
Subject: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store
Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
1 byte at a time loop to handle the tail.
This is trivially avoidable with overlapping stores, which is the
standard technique for these kind of routines (in fact memcpy() is
already using it in a more extensive form).
I traced calls in this range during a kernel build and found that 65% of
all of them had tail to take care of.
Distribution of size & 7 in that range is as follows:
@:
[0, 1) 3282178 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 949627 |@@@@@@@@@@@@@@@ |
[2, 3) 1078842 |@@@@@@@@@@@@@@@@@ |
[3, 4) 998995 |@@@@@@@@@@@@@@@ |
[4, 5) 744515 |@@@@@@@@@@@ |
[5, 6) 925437 |@@@@@@@@@@@@@@ |
[6, 7) 663305 |@@@@@@@@@@ |
[7, ...) 786007 |@@@@@@@@@@@@ |
@stats[notail]: 3282178
@stats[tail]: 6146728
Signed-off-by: Mateusz Guzik <mjguzik@...il.com>
---
I added a call to a custom copy_user_probe() routine to
copy_user_generic so that I can attach to it with bpftrace. Available
upon request. The one-liner is:
bpftrace -e 'kprobe:copy_user_probe /arg2 >= 8 && arg2 <= 64/ \
{ @ = lhist(arg2 & 7, 0, 7, 1); @stats[arg2 & 7 ? "tail" : "notail"] = count(); }'
Anyhow, I don't have any means to benchmark this at the moment as the
only hw I have access to is in fact a little too modern (FSRM), but this
being the standard technique I think wont require much convincing. If
anything the question is why this differs from memcpy which *does* use
overlapping stores.
Tested by forcing the kernel to use the routine, did the kernel build
just fine with the change.
That said, absent own bench results I'm not going to strongly argue for
the change but I do think it is an ok tidy-up until someone(tm) puts
more effort here.
arch/x86/lib/copy_user_64.S | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341b2441..2d5f42c521b5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -65,10 +65,15 @@ SYM_FUNC_START(rep_movs_alternative)
je .Lexit
cmp $8,%ecx
jae .Lword
- jmp .Lcopy_user_tail
+4: movq -8(%rsi,%rcx),%rax
+5: movq %rax,-8(%rdi,%rcx)
+ xorl %ecx,%ecx
+ RET
_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+ _ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
.Llarge:
0: ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
--
2.43.0
Powered by blists - more mailing lists