linux-kernel - Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjxi0poUzCd666Kx5wCjgOwN5v=-zG8xSAL7Wj_ax8Zvw@mail.gmail.com>
Date: Thu, 20 Mar 2025 16:53:32 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: x86@...nel.org, hkrzesin@...hat.com, tglx@...utronix.de, mingo@...hat.com, 
	bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com, olichtne@...hat.com, 
	atomasov@...hat.com, aokuliar@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an
 overlapping store

On Thu, 20 Mar 2025 at 14:17, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Thu, 20 Mar 2025 at 12:33, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > I have a recollection that handling the tail after rep movsq with an
> > overlapping store was suffering a penalty big enough to warrant a
> > "normal" copy instead, avoiding the just written to area.
>
> Ahh. Good point. The rep movsq might indeed end up having odd effects
> with subsequent aliasing memory operations.
>
> Consider myself convinced.

Actually, I think there's a solution for this.

Do not do the last 0-7 bytes as a word that overlaps with the tail of
the 'rep movs'

Do the last 8-15 bytes *non-overlapping* (well, they overlap each
other, but not the 'rep movs')

Something UNTESTED like the appended, in other words. The large case
then ends up without any conditionals, looking something like this:

        mov    %rcx,%rax
        shr    $0x3,%rcx
        dec    %rcx
        and    $0x7,%eax
        rep movsq %ds:(%rsi),%es:(%rdi)
        mov    (%rsi),%rcx
        mov    %rcx,(%rdi)
        mov    (%rsi,%rax,1),%rcx
        mov    %rcx,(%rdi,%rax,1)
        xor    %ecx,%ecx
        ret

with some added complexity - but not a lot - in the exception fixup cases.

This is once again intentionally whitespace-damaged, because I don't
want people applying this mindlessly. Somebody needs to double-check
my logic, and verify that this also avoids the cost from the aliasing
with the rep movs.

                   Linus

---
  --- a/arch/x86/lib/copy_user_64.S
  +++ b/arch/x86/lib/copy_user_64.S
  @@ -76,16 +76,38 @@ SYM_FUNC_START(rep_movs_alternative)
   .Llarge_movsq:
        movq %rcx,%rax
        shrq $3,%rcx
  +     decq %rcx
        andl $7,%eax
  +
  +     /* 8*%rcx + 8 + %rax bytes: do the 8*%rcx */
   0:   rep movsq
  -     movl %eax,%ecx
  -     testl %ecx,%ecx
  -     jne .Lcopy_user_tail
  +
  +     /* We now have 8 + %rax bytes left */
  +1:   movq (%rsi),%rcx
  +2:   movq %rcx,(%rdi)
  +
  +     /* %rax bytes left - do it as one overlapping word */
  +3:   movq (%rsi,%rax),%rcx
  +4:   movq %rcx,(%rdi,%rax)
  +
  +     xorl %ecx,%ecx
        RET

  -1:   leaq (%rax,%rcx,8),%rcx
  +10:  leaq 8(%rax,%rcx,8),%rcx
        jmp .Lcopy_user_tail

  -     _ASM_EXTABLE_UA( 0b, 1b)
  +11:  leaq 8(%rax),%rcx
  +     jmp .Lcopy_user_tail
  +
  +12:  addq $8,%rsi
  +     addq $8,%rdi
  +     movl %eax,%ecx
  +     jmp .Lcopy_user_tail
  +
  +     _ASM_EXTABLE_UA( 0b, 10b)
  +     _ASM_EXTABLE_UA( 1b, 11b)
  +     _ASM_EXTABLE_UA( 2b, 11b)
  +     _ASM_EXTABLE_UA( 3b, 12b)
  +     _ASM_EXTABLE_UA( 4b, 12b)
   SYM_FUNC_END(rep_movs_alternative)
   EXPORT_SYMBOL(rep_movs_alternative)