linux-kernel - RE: [PATCH RFC] [X86] performance improvement for memcpy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 12 Nov 2009 10:12:14 +0800
From:	"Ma, Ling" <ling.ma@...el.com>
To:	"H. Peter Anvin" <hpa@...or.com>
CC:	Ingo Molnar <mingo@...e.hu>, Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH RFC] [X86] performance improvement for memcpy_64.S by
 fast string.

>-----Original Message-----
>From: H. Peter Anvin [mailto:hpa@...or.com]
>Sent: 2009年11月12日 7:21
>To: Ma, Ling
>Cc: Ingo Molnar; Ingo Molnar; Thomas Gleixner; linux-kernel
>Subject: Re: [PATCH RFC] [X86] performance improvement for memcpy_64.S by fast
>string.
>
>On 11/10/2009 11:57 PM, Ma, Ling wrote:
>> Hi Ingo
>>
>> This program is for 64bit version, so please use 'cc -o memcpy  memcpy.c -O2
>-m64'
>>
>
>I did some measurements with this program; I added power-of-two
>measurements from 1-512 bytes, plus some different alignments, and found
>some very interesting results:
>
>Nehalem:
>	memcpy_new is a win for 1024+ bytes, but *also* a win for 2-32
>	bytes, where the old code apparently performs appallingly bad.
>
>	memcpy_new loses in the 64-512 byte range, so the 1024
>	threshold is probably justified.
>
>Core2:
>	memcpy_new is a win for <= 512 bytes, but a lose for larger
>	copies (possibly a win again for 16K+ copies, but those are
>	very rare in the Linux kernel.)  Surprise...
>
>	However, the difference is very small.
>
>However, I had overlooked something much more fundamental about your
>patch.  On Nehalem, at least *it will never get executed* (except during
>very early startup), because we replace the memcpy code with a jmp to
>memcpy_c on any CPU which has X86_FEATURE_REP_GOOD, which includes Nehalem.
>
>So the patch is a no-op on Nehalem, and any other modern CPU.

[Ma Ling]
It is good for modern CPU, our original intention is also to introduce movsq for Nehalem, above method is more smart.

>Am I guessing that the perf numbers you posted originally were all from
>your user space test program?

[Ma Ling] 
Yes, they are all from this program, and I'm confused about measurement values will be different for only one case after multiple tests.
(3 times at least on my core2 platform).

Thanks
Ling