[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAOGi=dMtFNK=Oec2aYSXT9CFfx=GtKEKi9MQ9+iv_SK044iodQ@mail.gmail.com>
Date: Mon, 22 Oct 2012 17:23:16 +0800
From: Ling Ma <ling.ma.program@...il.com>
To: mingo@...e.hu
Cc: hpa@...or.com, tglx@...utronix.de, linux-kernel@...r.kernel.org,
iant@...gle.com, Ma Ling <ling.ma.program@...il.com>
Subject: Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long
latency from decode stage
Attached memcpy micro benchmark, cpu info ,comparison results between
rep movsq/b and memcpy on atom, ivb.
Thanks
Ling
2012/10/23, ling.ma.program@...il.com <ling.ma.program@...il.com>:
> From: Ma Ling <ling.ma.program@...il.com>
>
> CISC code has higher instruction density, saving memory and
> improving i-cache hit rate. However decode become challenge,
> only one mulitple-uops(2~3)instruction could be decoded in one cycle,
> and instructions containing more 4 uops(rep movsq/b) have to be handled by
> MS-ROM,
> the process take long time and eat up the advantage from it for small size.
>
>
> In order to avoid this disavantage, we take use of general instruction code
> for small size copy. The result shows it can get 1~2x improvement
> on Core2, Nehalem, Sandy Bridge, Ivy Bridge, Atom, and Bulldozer as well.
>
> Signed-off-by: Ma Ling <ling.ma.program@...il.com>
> ---
> In this version we decrease warm up distance from 512 to 256 for coming
> CPUs,
> which manage to reduce latency, but long time to decode is still consumed.
>
> Thanks
> Ling
>
> arch/x86/lib/memcpy_64.S | 14 +++++++++++++-
> 1 files changed, 13 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 1c273be..6a24c8c 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -5,7 +5,6 @@
> #include <asm/cpufeature.h>
> #include <asm/dwarf2.h>
> #include <asm/alternative-asm.h>
> -
> /*
> * memcpy - Copy a memory block.
> *
> @@ -19,6 +18,15 @@
> */
>
> /*
> + * memcpy_c() and memcpy_c_e() use rep movsq/movsb respectively,
> + * the instruction have to get micro ops from Microcode Sequencser Rom.
> + * And the decode process take long latency, in order to avoid it,
> + * we choose loop unrolling routine for small size.
> + * Could vary the warm up distance.
> + */
> +
> +
> +/*
> * memcpy_c() - fast string ops (REP MOVSQ) based variant.
> *
> * This gets patched over the unrolled variant (below) via the
> @@ -26,6 +34,8 @@
> */
> .section .altinstr_replacement, "ax", @progbits
> .Lmemcpy_c:
> + cmpq $256, %rdx
> + jbe memcpy
> movq %rdi, %rax
> movq %rdx, %rcx
> shrq $3, %rcx
> @@ -46,6 +56,8 @@
> */
> .section .altinstr_replacement, "ax", @progbits
> .Lmemcpy_c_e:
> + cmpq $256, %rdx
> + jbe memcpy
> movq %rdi, %rax
> movq %rdx, %rcx
> rep movsb
> --
> 1.6.5.2
>
>
View attachment "atom-cpu-info" of type "text/plain" (1464 bytes)
View attachment "atom-memcpy-result" of type "text/plain" (2252 bytes)
View attachment "ivb-cpu-info" of type "text/plain" (3788 bytes)
View attachment "ivb-memcpy-result" of type "text/plain" (2163 bytes)
View attachment "memcpy-kernel.c" of type "text/x-csrc" (6959 bytes)
Powered by blists - more mailing lists