linux-kernel - RE: [PATCH RFC] [X86] performance improvement for memcpy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 9 Nov 2009 15:24:03 +0800
From:	"Ma, Ling" <ling.ma@...el.com>
To:	"H. Peter Anvin" <hpa@...or.com>
CC:	"mingo@...e.hu" <mingo@...e.hu>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by
 fast string.

Hi All

Today we run our benchmark on Core2 and Sandy Bridge:

1. Retrieve result on Core2
Speedup on Core2
   Len        Alignement             Speedup
  1024,       0/ 0:                 0.95x 
  2048,       0/ 0:                 1.03x 
  3072,       0/ 0:                 1.02x 
  4096,       0/ 0:                 1.09x 
  5120,       0/ 0:                 1.13x 
  6144,       0/ 0:                 1.13x 
  7168,       0/ 0:                 1.14x 
  8192,       0/ 0:                 1.13x 
  9216,       0/ 0:                 1.14x 
  10240,      0/ 0:                 0.99x 
  11264,      0/ 0:                 1.14x 
  12288,      0/ 0:                 1.14x 
  13312,      0/ 0:                 1.10x 
  14336,      0/ 0:                 1.10x 
  15360,      0/ 0:                 1.13x
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3323.041832  task-clock-msecs         #      0.998 CPUs  ( +-   0.016% )
             22  context-switches         #      0.000 M/sec ( +-  31.913% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     9921549804  cycles                   #   2985.683 M/sec ( +-   0.016% )
    10863809359  instructions             #      1.095 IPC   ( +-   0.000% )
      972283451  cache-references         #    292.588 M/sec ( +-   0.018% )
          17703  cache-misses             #      0.005 M/sec ( +-   4.304% )

    3.330714469  seconds time elapsed   ( +-   0.021% )
After the patch:
Performance counter stats for './static_new' (10 runs):
    3392.902871  task-clock-msecs         #      0.998 CPUs ( +-   0.226% )
             21  context-switches         #      0.000 M/sec ( +-  30.982% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
    10130188030  cycles                   #   2985.699 M/sec ( +-   0.227% )
      391981414  instructions             #      0.039 IPC   ( +-   0.013% )
      874161826  cache-references         #    257.644 M/sec ( +-   3.034% )
          17628  cache-misses             #      0.005 M/sec ( +-   4.577% )

    3.400681174  seconds time elapsed   ( +-   0.219% )

2. Retrieve result on Sandy Bridge
  Speedup on Sandy Bridge
  Len        Alignement            Speedup
  1024,       0/ 0:                1.08x 
  2048,       0/ 0:                1.42x 
  3072,       0/ 0:                1.51x 
  4096,       0/ 0:                1.63x 
  5120,       0/ 0:                1.67x 
  6144,       0/ 0:                1.72x 
  7168,       0/ 0:                1.75x 
  8192,       0/ 0:                1.77x 
  9216,       0/ 0:                1.80x 
  10240,      0/ 0:                1.80x 
  11264,      0/ 0:                1.82x 
  12288,      0/ 0:                1.85x 
  13312,      0/ 0:                1.85x 
  14336,      0/ 0:                1.88x 
  15360,      0/ 0:                1.88x 
                                  
Application run through perf
For (i= 1024; i < 1024 * 16; i = i + 64)
	do_memcpy(0, 0, i);
Run application by 'perf stat --repeat 10 ./static_orig/new'
Before the patch:
Performance counter stats for './static_orig' (10 runs):

    3787.441240  task-clock-msecs         #      0.995 CPUs  ( +-   0.140% )
              8  context-switches         #      0.000 M/sec ( +-  22.602% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.001 M/sec ( +-   0.003% )
     6053487926  cycles                   #   1598.305 M/sec ( +-   0.140% )
    10861025194  instructions             #      1.794 IPC   ( +-   0.001% )
        2823963  cache-references         #      0.746 M/sec ( +-  69.345% )
         266000  cache-misses             #      0.070 M/sec ( +-   0.980% )

    3.805400837  seconds time elapsed   ( +-   0.139% )
After the patch:
Performance counter stats for './static_new' (10 runs):

    2879.424879  task-clock-msecs         #      0.995 CPUs  ( +-   0.076% )
             10  context-switches         #      0.000 M/sec ( +-  24.761% )
              0  CPU-migrations           #      0.000 M/sec ( +-     nan% )
           4428  page-faults              #      0.002 M/sec ( +-   0.003% )
     4602155158  cycles                   #   1598.290 M/sec ( +-   0.076% )
      386146993  instructions             #      0.084 IPC   ( +-   0.005% )
         520008  cache-references         #      0.181 M/sec ( +-   8.077% )
         267345  cache-misses             #      0.093 M/sec ( +-   0.792% )

    2.893813235  seconds time elapsed   ( +-   0.085% )

Thanks
Ling

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@...or.com]
>Sent: 2009年11月7日 3:26
>To: Ma, Ling
>Cc: mingo@...e.hu; tglx@...utronix.de; linux-kernel@...r.kernel.org
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/06/2009 09:07 AM, H. Peter Anvin wrote:
>>
>> Where did the 1024 byte threshold come from?  It seems a bit high to me,
>> and is at the very best a CPU-specific tuning factor.
>>
>> Andi is of course correct that older CPUs might suffer (sadly enough),
>> which is why we'd at the very least need some idea of what the
>> performance impact on those older CPUs would look like -- at that point
>> we can make a decision to just unconditionally do the rep movs or
>> consider some system where we point at different implementations for
>> different processors -- memcpy is probably one of the very few
>> operations for which something like that would make sense.
>>
>
>To be expicit: Ling, would you be willing to run some benchmarks across
>processors to see how this performs on non-Nehalem CPUs?
>
>	-hpa