linux-kernel - RE: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy() for unaligned copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <C10D3FB0CD45994C8A51FEC1227CE22F15D77726AF@shsmsx502.ccr.corp.intel.com>
Date:	Tue, 19 Oct 2010 12:06:23 +0800
From:	"Ma, Ling" <ling.ma@...el.com>
To:	"miaox@...fujitsu.com" <miaox@...fujitsu.com>
CC:	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Zhao, Yakui" <yakui.zhao@...el.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH V2 -tip] lib,x86_64: improve the performance of memcpy()
 for unaligned copy



On Mon, 18 Oct 2010 16:01:13 +0800, Ma, Ling wrote:
>>>> rep_good will cause memcpy jump to memcpy_c, so not run this patch,
>>> we may continue to do further optimization on it later.
>>
>>> Yes, but in fact, the performance of memcpy_c is not better on some micro-architecture(such as:
>>> Wolfdale-3M, ), especially in the unaligned cases, so we need do optimization for it, and I think
>>> the first step of optimization is optimizing the original code of memcpy().
>>
>> As mentioned above , we will optimize further memcpy_c soon.
>> Two reasons :
>>    1. movs instruction need long lantency to startup
>>    2. movs instruction is not good for unaligned case.
>>
>>>> BTW the improvement is only from core2 shift register optimization,
>>>> but for most previous cpus shift register is very sensitive because of decode stage.
>>>> I have test Atom, Opteron, and Nocona, new patch is still better.
>>
>>> I think we can add a flag to make this improvement only valid for Core2 or other CPU like it,
>>> just like X86_FEATURE_REP_GOOD.
>>
>> We should optimize core2 in memcpy_c function in future, I think.

>But there is a problem, the length of new instruction must be less or equal the length of
>original instruction if we use alternatives, but IT seems the length of core2's optimization
>instruction may be greater than the original instruction. So I think we can't optimize core2
>in memcpy_c function, just in memcpy function.
We keep above rule because we worry about i-cache capability miss and impact total performance.
However we have several questions about it according to modern CPU arch.
1. Current Linux kernel is far more previous versions and i-cache size(32k). 
2. Hardware prefetch predication become more important and sophisticated, even when we access current cache line,
   Hardware prefetch will fetch next line/lines on intel and AMD platform.
3. Based on our test, we don't find compile operation Os(for size) is better than O2 (for performance) totally on modern CPU,
   such as specjbb2005/2000, volano, kbuild ...,
4. We have found memcpy_c have performance problem, we should manage to resolve it in small size as possible.
   It is strange to separate core2 from other cpus by appending new flag,
   And I think your patch must be bigger than last version.

Thanks
Ling


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/