phc-discussions - Re: [PHC] yescrypt AVX2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20150430173852.GA26885@openwall.com>
Date: Thu, 30 Apr 2015 20:38:52 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt AVX2

On Thu, Apr 30, 2015 at 05:32:48PM +0100, Samuel Neves wrote:
> On 04/30/2015 03:14 PM, Solar Designer wrote:
> > #define LOAD2X128(hi, lo) ({ \
> > 	register __m256i out asm("ymm0"); \
> > 	__asm__("\n\tvbroadcasti128 %1,%%ymm0" \
> > 		"\n\tmovdqa %2,%%xmm0" \
> > 		: "=xm" (out) \
> > 		: "xm" (hi), "xm" (lo)); \
> > 	out; \
> > })
> >
> > I got ridiculous speeds - like 30 times lower.
> 
> Is it me, or are you mixing VEX-encoded instructions with regular ones in the above snippet? That is bound to generate a
> large number of stalls transitioning between YMM register states.

Oh, indeed.  This kills my idea since vmovdqa zeroes the upper 128 bits.

I've just checked how icc does this.  They have _mm256_loadu2_m128i(),
and for me it got translated to:

        vmovups   (%rsi), %xmm0
        vinsertf128 $1, (%rdi), %ymm0, %ymm0

when building for AVX, and to:

        vmovdqu   (%rsi), %xmm0
        vinserti128 $1, (%rdi), %ymm0, %ymm0

when building for AVX2.

Intel's own Intrinsics Guide documents the intrinsics _mm256_insertf*()
as having 1 cycle latency on Sandy/Ivy Bridge, but 3 cycles on Haswell,
and _mm256_inserti128_si256() as having 3 cycles latency on Haswell.  So
apparently there's no better way.

Alexander