[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOLP8p5X40nbRY56Ty4QcLR_5CyY2sTX=PwqD4vkdORKz9+HBQ@mail.gmail.com>
Date: Wed, 19 Mar 2014 21:12:57 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: Supporting AVX2/SSE2 or not with a single binary
Here's some speed results for hashing 2 GiB with and without SSE2 on
my Ivy Bridge Core i7, and how the lanes parameter effects runtime.
Normal SSE2 build, 8 lanes:
twocats> !
twocats> !time
time twocats -m21 -l8 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:8 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64
53 dc 33 f3 b4 86 1e fb
96 1c c1 93 5b 7d 79 2a
df 06 86 40 39 08 27 e7
e6 76 1b 5a e4 2d a0 2f 32 (octets)
real 0m0.315s
user 0m0.490s
sys 0m0.130s
i686 build, no SSE at all, 8 lanes:
time twocats -m21 -l8 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:8 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64
Rolled 0
53 dc 33 f3 b4 86 1e fb
96 1c c1 93 5b 7d 79 2a
df 06 86 40 39 08 27 e7
e6 76 1b 5a e4 2d a0 2f 32 (octets)
real 0m1.296s
user 0m2.340s
sys 0m0.100s
That's a heck of a penalty! So, if you have no SIMD unit, you can
gain back some, but not all, of the speed by setting the parallel
lanes to 1:
time twocats -m21 -l1 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:1 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64
c4 4b d0 df 16 ed df 10
0a 4d ff 27 31 fa 58 07
5b 98 ca 12 3e 19 fe 1a
16 bd 2c 33 54 55 90 b3 32 (octets)
real 0m0.895s
user 0m1.670s
sys 0m0.080s
It's still much worse than the 0.32 seconds I get with SSE2 enabled,
but if I use 64-bit mode, and -march-native, it does a bit better with
one lane and no SSE:
time twocats -m21 -l1 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:1 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64
c4 4b d0 df 16 ed df 10
0a 4d ff 27 31 fa 58 07
5b 98 ca 12 3e 19 fe 1a
16 bd 2c 33 54 55 90 b3 32 (octets)
real 0m0.777s
user 0m1.340s
sys 0m0.200s
That's about 2X worse than what I get with SSE. I think I could get
that to run a bit faster with some more optimization, but SSE is very
useful for filling memory bandwidth, if that is the goal.
Bill
Powered by blists - more mailing lists