linux-kernel - Re: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy() for unaligned copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 18 Oct 2010 15:42:55 +0800
From:	Miao Xie <miaox@...fujitsu.com>
To:	"Ma, Ling" <ling.ma@...el.com>
CC:	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Zhao, Yakui" <yakui.zhao@...el.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy()
 for unaligned copy

On Mon, 18 Oct 2010 14:43:32 +0800, Ma, Ling wrote:
> "wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm"
>
> rep_good will cause memcpy jump to memcpy_c, so not run this patch,
> we may continue to do further optimization on it later.

Yes, but in fact, the performance of memcpy_c is not better on some micro-architecture(such as:
Wolfdale-3M, ), especially in the unaligned cases, so we need do optimization for it, and I think
the first step of optimization is optimizing the original code of memcpy().

> BTW the improvement is only from core2 shift register optimization,
> but for most previous cpus shift register is very sensitive because of decode stage.
> I have test Atom, Opteron, and Nocona, new patch is still better.

I think we can add a flag to make this improvement only valid for Core2 or other CPU like it,
just like X86_FEATURE_REP_GOOD.

Regards
Miao

>
> Thanks
> Ling
>
> -----Original Message-----
> From: Miao Xie [mailto:miaox@...fujitsu.com]
> Sent: Monday, October 18, 2010 2:35 PM
> To: Ma, Ling
> Cc: H. Peter Anvin; Ingo Molnar; Andi Kleen; Thomas Gleixner; Zhao, Yakui; Linux Kernel
> Subject: Re: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy() for unaligned copy
>
> On Mon, 18 Oct 2010 14:27:40 +0800, Ma, Ling wrote:
>> Could please send out cpu info for this cpu model.
>
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 23
> model name	: Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz
> stepping	: 6
> cpu MHz		: 1603.000
> cache size	: 3072 KB
> physical id	: 0
> siblings	: 2
> core id		: 0
> cpu cores	: 2
> apicid		: 0
> initial apicid	: 0
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 10
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm
> bogomips	: 5319.99
> clflush size	: 64
> cache_alignment	: 64
> address sizes	: 36 bits physical, 48 bits virtual
> power management:
>
> Thanks
> Miao
>
>>
>> Thanks
>> Ling
>>
>> -----Original Message-----
>> From: Miao Xie [mailto:miaox@...fujitsu.com]
>> Sent: Monday, October 18, 2010 2:24 PM
>> To: Ma, Ling
>> Cc: H. Peter Anvin; Ingo Molnar; Andi Kleen; Thomas Gleixner; Zhao, Yakui; Linux Kernel
>> Subject: Re: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy() for unaligned copy
>>
>> On Fri, 15 Oct 2010 03:43:53 +0800, Ma, Ling wrote:
>>> Attachment includes memcpy-kernel.c(cc -O2 memcpy-kernel.c -o
>>> memcpy-kernel), and unaligned test cases on Atom.
>>
>> I have tested on my Core2 Duo machine with your benchmark tool. Attachment is the test result. But the result is different with yours on Atom, It seems the performance is better with this patch.
>>
>> According to these two different result, maybe we need optimize memcpy() by CPU model.
>>
>> Thanks
>> Miao
>>
>>>
>>> Thanks
>>> Ling
>>>
>>> -----Original Message-----
>>> From: Ma, Ling
>>> Sent: Thursday, October 14, 2010 9:14 AM
>>> To: 'H. Peter Anvin'; miaox@...fujitsu.com
>>> Cc: Ingo Molnar; Andi Kleen; Thomas Gleixner; Zhao, Yakui; Linux
>>> Kernel
>>> Subject: RE: [PATCH V2 -tip] lib,x86_64: improve the performance of
>>> memcpy() for unaligned copy
>>>
>>> Sure, I will post benchmark tool and benchmark on Atom 64bit soon.
>>>
>>> Thanks
>>> Ling
>>>
>>> -----Original Message-----
>>> From: H. Peter Anvin [mailto:hpa@...or.com]
>>> Sent: Thursday, October 14, 2010 5:32 AM
>>> To: miaox@...fujitsu.com
>>> Cc: Ma, Ling; Ingo Molnar; Andi Kleen; Thomas Gleixner; Zhao, Yakui;
>>> Linux Kernel
>>> Subject: Re: [PATCH V2 -tip] lib,x86_64: improve the performance of
>>> memcpy() for unaligned copy
>>>
>>> On 10/08/2010 02:02 AM, Miao Xie wrote:
>>>> On Fri, 8 Oct 2010 15:42:45 +0800, Ma, Ling wrote:
>>>>> Could you please give us full address for each comparison result,we will do some tests on my machine.
>>>>> For unaligned cases older cpus will crossing cache line and slow down caused by load and store, but for nhm, no necessary to care about it.
>>>>> By the way in kernel 64bit mode, our access mode should be around 8byte aligned.
>>>>
>>>> Would you need my benchmark tool? I think it is helpful for your test.
>>>>
>>>
>>> If you could post the benchmark tool that would be great.
>>>
>>> 	-hpa
>>
>>
>>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/