[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjxi0poUzCd666Kx5wCjgOwN5v=-zG8xSAL7Wj_ax8Zvw@mail.gmail.com>
Date: Thu, 20 Mar 2025 16:53:32 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: x86@...nel.org, hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com,
atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
overlapping store
On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > I have a recollection that handling the tail after rep movsq with an
> > overlapping store was suffering a penalty big enough to warrant a
> > "normal" copy instead, avoiding the just written to area.
>
> Ahh. Good point. The rep movsq might indeed end up having odd effects
> with subsequent aliasing memory operations.
>
> Consider myself convinced.
Actually, I think there's a solution for this.
Do not do the last 0-7 bytes as a word that overlaps with the tail of
the 'rep movs'
Do the last 8-15 bytes *non-overlapping* (well, they overlap each
other, but not the 'rep movs')
Something UNTESTED like the appended, in other words. The large case
then ends up without any conditionals, looking something like this:
mov %rcx,%rax
shr $0x3,%rcx
dec %rcx
and $0x7,%eax
rep movsq %ds:(%rsi),%es:(%rdi)
mov (%rsi),%rcx
mov %rcx,(%rdi)
mov (%rsi,%rax,1),%rcx
mov %rcx,(%rdi,%rax,1)
xor %ecx,%ecx
ret
with some added complexity - but not a lot - in the exception fixup cases.
This is once again intentionally whitespace-damaged, because I don't
want people applying this mindlessly. Somebody needs to double-check
my logic, and verify that this also avoids the cost from the aliasing
with the rep movs.
Linus
---
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -76,16 +76,38 @@ SYM_FUNC_START(rep_movs_alternative)
.Llarge_movsq:
movq %rcx,%rax
shrq $3,%rcx
+ decq %rcx
andl $7,%eax
+
+ /* 8*%rcx + 8 + %rax bytes: do the 8*%rcx */
0: rep movsq
- movl %eax,%ecx
- testl %ecx,%ecx
- jne .Lcopy_user_tail
+
+ /* We now have 8 + %rax bytes left */
+1: movq (%rsi),%rcx
+2: movq %rcx,(%rdi)
+
+ /* %rax bytes left - do it as one overlapping word */
+3: movq (%rsi,%rax),%rcx
+4: movq %rcx,(%rdi,%rax)
+
+ xorl %ecx,%ecx
RET
-1: leaq (%rax,%rcx,8),%rcx
+10: leaq 8(%rax,%rcx,8),%rcx
jmp .Lcopy_user_tail
- _ASM_EXTABLE_UA( 0b, 1b)
+11: leaq 8(%rax),%rcx
+ jmp .Lcopy_user_tail
+
+12: addq $8,%rsi
+ addq $8,%rdi
+ movl %eax,%ecx
+ jmp .Lcopy_user_tail
+
+ _ASM_EXTABLE_UA( 0b, 10b)
+ _ASM_EXTABLE_UA( 1b, 11b)
+ _ASM_EXTABLE_UA( 2b, 11b)
+ _ASM_EXTABLE_UA( 3b, 12b)
+ _ASM_EXTABLE_UA( 4b, 12b)
SYM_FUNC_END(rep_movs_alternative)
EXPORT_SYMBOL(rep_movs_alternative)
Powered by blists - more mailing lists