phc-discussions - Re: Supporting AVX2/SSE2 or not with a single binary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p5X40nbRY56Ty4QcLR_5CyY2sTX=PwqD4vkdORKz9+HBQ@mail.gmail.com>
Date: Wed, 19 Mar 2014 21:12:57 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: Supporting AVX2/SSE2 or not with a single binary

Here's some speed results for hashing 2 GiB with and without SSE2 on
my Ivy Bridge Core i7, and how the lanes parameter effects runtime.

Normal SSE2 build, 8 lanes:

twocats> !
twocats> !time
time twocats -m21 -l8 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:8 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64

53 dc 33 f3 b4 86 1e fb
96 1c c1 93 5b 7d 79 2a
df 06 86 40 39 08 27 e7
e6 76 1b 5a e4 2d a0 2f      32 (octets)


real    0m0.315s
user    0m0.490s
sys     0m0.130s


i686 build, no SSE at all, 8 lanes:

time twocats -m21 -l8 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:8 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64
Rolled 0

53 dc 33 f3 b4 86 1e fb
96 1c c1 93 5b 7d 79 2a
df 06 86 40 39 08 27 e7
e6 76 1b 5a e4 2d a0 2f      32 (octets)


real    0m1.296s
user    0m2.340s
sys     0m0.100s


That's a heck of a penalty!  So, if you have no SIMD unit, you can
gain back some, but not all, of the speed by setting the parallel
lanes to 1:

time twocats -m21 -l1 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:1 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64

c4 4b d0 df 16 ed df 10
0a 4d ff 27 31 fa 58 07
5b 98 ca 12 3e 19 fe 1a
16 bd 2c 33 54 55 90 b3      32 (octets)


real    0m0.895s
user    0m1.670s
sys     0m0.080s


It's still much worse than the 0.32 seconds I get with SSE2 enabled,
but if I use 64-bit mode, and -march-native, it does a bit better with
one lane and no SSE:

time twocats -m21 -l1 -M0
hash:blake2s memCost:21 timeCost:0 multiplies:0 lanes:1 parallelism:2
password:password salt:salt blockSize:16384 subBlockSize:64

c4 4b d0 df 16 ed df 10
0a 4d ff 27 31 fa 58 07
5b 98 ca 12 3e 19 fe 1a
16 bd 2c 33 54 55 90 b3      32 (octets)


real    0m0.777s
user    0m1.340s
sys     0m0.200s


That's about 2X worse than what I get with SSE.  I think I could get
that to run a bit faster with some more optimization, but SSE is very
useful for filling memory bandwidth, if that is the goal.

Bill