linux-kernel - Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAOGi=dMtFNK=Oec2aYSXT9CFfx=GtKEKi9MQ9+iv_SK044iodQ@mail.gmail.com>
Date:	Mon, 22 Oct 2012 17:23:16 +0800
From:	Ling Ma <ling.ma.program@...il.com>
To:	mingo@...e.hu
Cc:	hpa@...or.com, tglx@...utronix.de, linux-kernel@...r.kernel.org,
	iant@...gle.com, Ma Ling <ling.ma.program@...il.com>
Subject: Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long
 latency from decode stage

Attached memcpy micro benchmark, cpu info ,comparison results between
rep movsq/b and memcpy on atom, ivb.

Thanks
Ling


2012/10/23, ling.ma.program@...il.com <ling.ma.program@...il.com>:
> From: Ma Ling <ling.ma.program@...il.com>
>
> CISC code has higher instruction density, saving memory and
> improving i-cache hit rate. However decode become challenge,
> only one mulitple-uops(2~3)instruction could be decoded in one cycle,
> and instructions containing more 4 uops(rep movsq/b) have to be handled by
> MS-ROM,
> the process take long time and eat up the advantage from it for small size.
>
>
> In order to avoid this disavantage, we take use of general instruction code
> for small size copy. The result shows it can get 1~2x improvement
> on Core2, Nehalem, Sandy Bridge, Ivy Bridge, Atom, and Bulldozer as well.
>
> Signed-off-by: Ma Ling <ling.ma.program@...il.com>
> ---
> In this version we decrease warm up distance from 512 to 256 for coming
> CPUs,
> which manage to reduce latency, but long time to decode is still consumed.
>
> Thanks
> Ling
>
>  arch/x86/lib/memcpy_64.S |   14 +++++++++++++-
>  1 files changed, 13 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
> index 1c273be..6a24c8c 100644
> --- a/arch/x86/lib/memcpy_64.S
> +++ b/arch/x86/lib/memcpy_64.S
> @@ -5,7 +5,6 @@
>  #include <asm/cpufeature.h>
>  #include <asm/dwarf2.h>
>  #include <asm/alternative-asm.h>
> -
>  /*
>   * memcpy - Copy a memory block.
>   *
> @@ -19,6 +18,15 @@
>   */
>
>  /*
> + * memcpy_c() and memcpy_c_e() use rep movsq/movsb respectively,
> + * the instruction have to get micro ops from Microcode Sequencser Rom.
> + * And the decode  process take long latency, in order to avoid it,
> + * we choose loop unrolling routine for small size.
> + * Could vary the warm up  distance.
> + */
> +
> +
> +/*
>   * memcpy_c() - fast string ops (REP MOVSQ) based variant.
>   *
>   * This gets patched over the unrolled variant (below) via the
> @@ -26,6 +34,8 @@
>   */
>  	.section .altinstr_replacement, "ax", @progbits
>  .Lmemcpy_c:
> +	cmpq $256, %rdx
> +	jbe  memcpy 	
>  	movq %rdi, %rax
>  	movq %rdx, %rcx
>  	shrq $3, %rcx
> @@ -46,6 +56,8 @@
>   */
>  	.section .altinstr_replacement, "ax", @progbits
>  .Lmemcpy_c_e:
> +	cmpq $256, %rdx
> +	jbe  memcpy
>  	movq %rdi, %rax
>  	movq %rdx, %rcx
>  	rep movsb
> --
> 1.6.5.2
>
>

View attachment "atom-cpu-info" of type "text/plain" (1464 bytes)

View attachment "atom-memcpy-result" of type "text/plain" (2252 bytes)

View attachment "ivb-cpu-info" of type "text/plain" (3788 bytes)

View attachment "ivb-memcpy-result" of type "text/plain" (2163 bytes)

View attachment "memcpy-kernel.c" of type "text/x-csrc" (6959 bytes)