lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20150430173852.GA26885@openwall.com> Date: Thu, 30 Apr 2015 20:38:52 +0300 From: Solar Designer <solar@...nwall.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] yescrypt AVX2 On Thu, Apr 30, 2015 at 05:32:48PM +0100, Samuel Neves wrote: > On 04/30/2015 03:14 PM, Solar Designer wrote: > > #define LOAD2X128(hi, lo) ({ \ > > register __m256i out asm("ymm0"); \ > > __asm__("\n\tvbroadcasti128 %1,%%ymm0" \ > > "\n\tmovdqa %2,%%xmm0" \ > > : "=xm" (out) \ > > : "xm" (hi), "xm" (lo)); \ > > out; \ > > }) > > > > I got ridiculous speeds - like 30 times lower. > > Is it me, or are you mixing VEX-encoded instructions with regular ones in the above snippet? That is bound to generate a > large number of stalls transitioning between YMM register states. Oh, indeed. This kills my idea since vmovdqa zeroes the upper 128 bits. I've just checked how icc does this. They have _mm256_loadu2_m128i(), and for me it got translated to: vmovups (%rsi), %xmm0 vinsertf128 $1, (%rdi), %ymm0, %ymm0 when building for AVX, and to: vmovdqu (%rsi), %xmm0 vinserti128 $1, (%rdi), %ymm0, %ymm0 when building for AVX2. Intel's own Intrinsics Guide documents the intrinsics _mm256_insertf*() as having 1 cycle latency on Sandy/Ivy Bridge, but 3 cycles on Haswell, and _mm256_inserti128_si256() as having 3 cycles latency on Haswell. So apparently there's no better way. Alexander
Powered by blists - more mailing lists