[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150430173852.GA26885@openwall.com>
Date: Thu, 30 Apr 2015 20:38:52 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt AVX2
On Thu, Apr 30, 2015 at 05:32:48PM +0100, Samuel Neves wrote:
> On 04/30/2015 03:14 PM, Solar Designer wrote:
> > #define LOAD2X128(hi, lo) ({ \
> > register __m256i out asm("ymm0"); \
> > __asm__("\n\tvbroadcasti128 %1,%%ymm0" \
> > "\n\tmovdqa %2,%%xmm0" \
> > : "=xm" (out) \
> > : "xm" (hi), "xm" (lo)); \
> > out; \
> > })
> >
> > I got ridiculous speeds - like 30 times lower.
>
> Is it me, or are you mixing VEX-encoded instructions with regular ones in the above snippet? That is bound to generate a
> large number of stalls transitioning between YMM register states.
Oh, indeed. This kills my idea since vmovdqa zeroes the upper 128 bits.
I've just checked how icc does this. They have _mm256_loadu2_m128i(),
and for me it got translated to:
vmovups (%rsi), %xmm0
vinsertf128 $1, (%rdi), %ymm0, %ymm0
when building for AVX, and to:
vmovdqu (%rsi), %xmm0
vinserti128 $1, (%rdi), %ymm0, %ymm0
when building for AVX2.
Intel's own Intrinsics Guide documents the intrinsics _mm256_insertf*()
as having 1 cycle latency on Sandy/Ivy Bridge, but 3 cycles on Haswell,
and _mm256_inserti128_si256() as having 3 cycles latency on Haswell. So
apparently there's no better way.
Alexander
Powered by blists - more mailing lists