lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 30 Apr 2015 20:38:52 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt AVX2

On Thu, Apr 30, 2015 at 05:32:48PM +0100, Samuel Neves wrote:
> On 04/30/2015 03:14 PM, Solar Designer wrote:
> > #define LOAD2X128(hi, lo) ({ \
> > 	register __m256i out asm("ymm0"); \
> > 	__asm__("\n\tvbroadcasti128 %1,%%ymm0" \
> > 		"\n\tmovdqa %2,%%xmm0" \
> > 		: "=xm" (out) \
> > 		: "xm" (hi), "xm" (lo)); \
> > 	out; \
> > })
> >
> > I got ridiculous speeds - like 30 times lower.
> 
> Is it me, or are you mixing VEX-encoded instructions with regular ones in the above snippet? That is bound to generate a
> large number of stalls transitioning between YMM register states.

Oh, indeed.  This kills my idea since vmovdqa zeroes the upper 128 bits.

I've just checked how icc does this.  They have _mm256_loadu2_m128i(),
and for me it got translated to:

        vmovups   (%rsi), %xmm0
        vinsertf128 $1, (%rdi), %ymm0, %ymm0

when building for AVX, and to:

        vmovdqu   (%rsi), %xmm0
        vinserti128 $1, (%rdi), %ymm0, %ymm0

when building for AVX2.

Intel's own Intrinsics Guide documents the intrinsics _mm256_insertf*()
as having 1 cycle latency on Sandy/Ivy Bridge, but 3 cycles on Haswell,
and _mm256_inserti128_si256() as having 3 cycles latency on Haswell.  So
apparently there's no better way.

Alexander

Powered by blists - more mailing lists