linux-kernel - Re: [PATCH v3] x86, mem: move memmove to out of line assembler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKwvOd=C=yhC5-fBVxD18GSWOibPpSJ1G4bfM=X0bw+LpTsfgg@mail.gmail.com>
Date:   Wed, 28 Sep 2022 12:06:43 -0700
From:   Nick Desaulniers <ndesaulniers@...gle.com>
To:     Rasmus Villemoes <linux@...musvillemoes.dk>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H . Peter Anvin" <hpa@...or.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Kees Cook <keescook@...omium.org>,
        linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        llvm@...ts.linux.dev, Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3] x86, mem: move memmove to out of line assembler

On Wed, Sep 28, 2022 at 12:24 AM Rasmus Villemoes
<linux@...musvillemoes.dk> wrote:
>
> On 27/09/2022 23.02, Nick Desaulniers wrote:
>
> > +     /* Decide forward/backward copy mode */
> > +     cmpl    dest, src
> > +     jb      .Lbackwards_header
>
> I know you're mostly just moving existing code, but for my own education
> I'd like to understand this.
>
> > +     /*
> > +      * movs instruction have many startup latency
> > +      * so we handle small size by general register.
> > +      */
> > +     cmpl    $680, n
> > +     jb      .Ltoo_small_forwards
>
> OK, this I get, there's some overhead, and hence we need _some_ cutoff
> value; 680 is probably chosen by some trial-and-error, but the exact
> value likely doesn't matter too much.

__memmove in arch/x86/lib/memmove_64.S uses the same value.
But I assume this is precisely why FSRM was created.
https://www.phoronix.com/news/Intel-5.6-FSRM-Memmove
commit f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")

>
> > +     /*
> > +      * movs instruction is only good for aligned case.
> > +      */
> > +     movl    src, tmp0
> > +     xorl    dest, tmp0
> > +     andl    $0xff, tmp0
> > +     jz      .Lforward_movs
>
> But this part I don't understand at all. This checks that the src and
> dest have the same %256 value, which is a rather odd thing, and very
> unlikely to ever be hit in practice. I could understand if it checked
> that they were both 4 or 8 or 16-byte aligned (i.e., (src|dest)&FOO)),
> or if it checked that they had the same offset within a cacheline [say
> (src^dest)&0x3f].
>
> Any idea where that comes from? Or am I just incapable of reading x86 asm?

So I think the above is roughly:
if ((src ^ dest) & 0xff == 0)
  goto .Lforward_movs;

So if src or dest don't have the same bottom byte value, don't use movs?

>
> > +.Ltoo_small_forwards:
> > +     subl    $0x10, n
> > +
> > +     /*
> > +      * We gobble 16 bytes forward in each loop.
> > +      */
> > +.L16_byteswap_forwards_loop:
> > +     subl    $0x10, n
> > +     movl    0*4(src), tmp0
> > +     movl    1*4(src), tmp1
> > +     movl    tmp0, 0*4(dest)
> > +     movl    tmp1, 1*4(dest)
> > +     movl    2*4(src), tmp0
> > +     movl    3*4(src), tmp1
> > +     movl    tmp0, 2*4(dest)
> > +     movl    tmp1, 3*4(dest)
> > +     leal    0x10(src), src
> > +     leal    0x10(dest), dest
> > +     jae     .L16_byteswap_forwards_loop
> > +     addl    $0x10, n
> > +     jmp     .L16_byteswap
> > +
> > +     /*
> > +      * Handle data forward by movs.
> > +      */
> > +.p2align 4
> > +.Lforward_movs:
> > +     movl    -4(src, n), tmp0
> > +     leal    -4(dest, n), tmp1
> > +     shrl    $2, n
> > +     rep     movsl
> > +     movl    tmp0, (tmp1)
> > +     jmp     .Ldone
>
> So in the original code, %1 was forced to be %esi and %2 was forced to
> be %edi and they were initialized by src and dest. But here I fail to
> see how those registers have been properly set up before the rep movs;
> your names for those are tmp0 and tmp2. You have just loaded the last
> word of the source to %edi, and AFAICT %esi aka tmp2 is entirely
> uninitialized at this point (the only use is in L16_byteswap).
>
> I must be missing something. Please enlighten me.

No, you're right.  It looks like rep movsl needs src in %esi and dest
needs to be in %edi, so I can't reuse the input registers from
-mregparm=3; a pair of movs is required.  A v4 is required.

Probably should write a test for memcpy where n > magic constant 680.
-- 
Thanks,
~Nick Desaulniers