[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKwvOd=C=yhC5-fBVxD18GSWOibPpSJ1G4bfM=X0bw+LpTsfgg@mail.gmail.com>
Date: Wed, 28 Sep 2022 12:06:43 -0700
From: Nick Desaulniers <ndesaulniers@...gle.com>
To: Rasmus Villemoes <linux@...musvillemoes.dk>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Kees Cook <keescook@...omium.org>,
linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
llvm@...ts.linux.dev, Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3] x86, mem: move memmove to out of line assembler
On Wed, Sep 28, 2022 at 12:24 AM Rasmus Villemoes
<linux@...musvillemoes.dk> wrote:
>
> On 27/09/2022 23.02, Nick Desaulniers wrote:
>
> > + /* Decide forward/backward copy mode */
> > + cmpl dest, src
> > + jb .Lbackwards_header
>
> I know you're mostly just moving existing code, but for my own education
> I'd like to understand this.
>
> > + /*
> > + * movs instruction have many startup latency
> > + * so we handle small size by general register.
> > + */
> > + cmpl $680, n
> > + jb .Ltoo_small_forwards
>
> OK, this I get, there's some overhead, and hence we need _some_ cutoff
> value; 680 is probably chosen by some trial-and-error, but the exact
> value likely doesn't matter too much.
__memmove in arch/x86/lib/memmove_64.S uses the same value.
But I assume this is precisely why FSRM was created.
https://www.phoronix.com/news/Intel-5.6-FSRM-Memmove
commit f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")
>
> > + /*
> > + * movs instruction is only good for aligned case.
> > + */
> > + movl src, tmp0
> > + xorl dest, tmp0
> > + andl $0xff, tmp0
> > + jz .Lforward_movs
>
> But this part I don't understand at all. This checks that the src and
> dest have the same %256 value, which is a rather odd thing, and very
> unlikely to ever be hit in practice. I could understand if it checked
> that they were both 4 or 8 or 16-byte aligned (i.e., (src|dest)&FOO)),
> or if it checked that they had the same offset within a cacheline [say
> (src^dest)&0x3f].
>
> Any idea where that comes from? Or am I just incapable of reading x86 asm?
So I think the above is roughly:
if ((src ^ dest) & 0xff == 0)
goto .Lforward_movs;
So if src or dest don't have the same bottom byte value, don't use movs?
>
> > +.Ltoo_small_forwards:
> > + subl $0x10, n
> > +
> > + /*
> > + * We gobble 16 bytes forward in each loop.
> > + */
> > +.L16_byteswap_forwards_loop:
> > + subl $0x10, n
> > + movl 0*4(src), tmp0
> > + movl 1*4(src), tmp1
> > + movl tmp0, 0*4(dest)
> > + movl tmp1, 1*4(dest)
> > + movl 2*4(src), tmp0
> > + movl 3*4(src), tmp1
> > + movl tmp0, 2*4(dest)
> > + movl tmp1, 3*4(dest)
> > + leal 0x10(src), src
> > + leal 0x10(dest), dest
> > + jae .L16_byteswap_forwards_loop
> > + addl $0x10, n
> > + jmp .L16_byteswap
> > +
> > + /*
> > + * Handle data forward by movs.
> > + */
> > +.p2align 4
> > +.Lforward_movs:
> > + movl -4(src, n), tmp0
> > + leal -4(dest, n), tmp1
> > + shrl $2, n
> > + rep movsl
> > + movl tmp0, (tmp1)
> > + jmp .Ldone
>
> So in the original code, %1 was forced to be %esi and %2 was forced to
> be %edi and they were initialized by src and dest. But here I fail to
> see how those registers have been properly set up before the rep movs;
> your names for those are tmp0 and tmp2. You have just loaded the last
> word of the source to %edi, and AFAICT %esi aka tmp2 is entirely
> uninitialized at this point (the only use is in L16_byteswap).
>
> I must be missing something. Please enlighten me.
No, you're right. It looks like rep movsl needs src in %esi and dest
needs to be in %edi, so I can't reuse the input registers from
-mregparm=3; a pair of movs is required. A v4 is required.
Probably should write a test for memcpy where n > magic constant 680.
--
Thanks,
~Nick Desaulniers
Powered by blists - more mailing lists