[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <af19820cd24544cd8833d6db6d38154b@AcuMS.aculab.com>
Date: Mon, 12 Jul 2021 08:15:41 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Matteo Croce' <mcroce@...ux.microsoft.com>,
Andrew Morton <akpm@...ux-foundation.org>
CC: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Nick Kossifidis <mick@....forth.gr>,
Guo Ren <guoren@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
Palmer Dabbelt <palmer@...belt.com>,
"Emil Renner Berthing" <kernel@...il.dk>,
Drew Fustini <drew@...gleboard.org>,
linux-arch <linux-arch@...r.kernel.org>,
Nick Desaulniers <ndesaulniers@...gle.com>,
linux-riscv <linux-riscv@...ts.infradead.org>
Subject: RE: [PATCH v2 0/3] lib/string: optimized mem* functions
From: Matteo Croce
> Sent: 11 July 2021 00:08
>
> On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
> <akpm@...ux-foundation.org> wrote:
> >
> > On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@...ux.microsoft.com> wrote:
> >
> > > From: Matteo Croce <mcroce@...rosoft.com>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > > the widest size possible, but without doing unaligned accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1], but as
> > > there was no specific RISC-V code, it was proposed for the generic
> > > lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > These are the performances of memcpy() and memset() of a RISC-V machine
> > > on a 32 mbyte buffer:
> > >
> > > memcpy:
> > > original aligned: 75 Mb/s
> > > original unaligned: 75 Mb/s
> > > new aligned: 114 Mb/s
> > > new unaligned: 107 Mb/s
> > >
> > > memset:
> > > original aligned: 140 Mb/s
> > > original unaligned: 140 Mb/s
> > > new aligned: 241 Mb/s
> > > new unaligned: 241 Mb/s
> >
> > Did you record the x86_64 performance?
> >
> >
> > Which other architectures are affected by this change?
>
> x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
> and has optimized implementations in arch/x86/lib.
> Anyway, I was curious and I tested them on x86_64 too, there was zero
> gain over the generic ones.
x86 performance (and attainable performance) does depend on the cpu
micro-archiecture.
Any recent 'desktop' intel cpu will almost certainly manage to
re-order the execution of almost any copy loop and attain 1 write per clock.
(Even the trivial 'while (count--) *dest++ = *src++;' loop.)
The same isn't true of the Atom based cpu that may be on small servers.
Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited
out-of-order execution and so are much more sensitive to instruction
ordering.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists