lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAOLP8p5+3Uj5AVinYhs8z99uFtQ3Z7gJKNF5QONcbEutgPmBnA@mail.gmail.com>
Date: Sun, 23 Feb 2014 15:00:36 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Initial AVX2 benchmarking

Solar Designer has generously allowed me to do a bit of development
and benchmarking on his Haswell machine, which supports Intel's latest
and greatest vector processing extensions, AVX2.  This expands the
vector instructions from 4 32-bit words in parallel to 8 32-bit words,
at least in the way I'm using it.  Here's the benchmark that blows my
mind.  This,

tigerkdf> time ./tigerkdf -m256 -M0 -r$((1024*1024)) -t12 -b4096 -B0
garlic:0 memorySize:256 multipliesPerKB:0 repetitions:1048576
numThreads:12 blockSize:4096 subBlockSize:0
Password:password Salt:salt

01 0d 55 38 e3 ef ca d8
57 9a dd 92 53 1b 2a 99
cc 30 cc ed 77 87 ea 74
a7 16 bb 66 9e e7 74 1d      32 (octets)


real    0m0.602s
user    0m4.464s
sys     0m0.008s

This is 12 threads hammering L1 cache in a hashing loop over and over.
 It's reading 0.25TiB twice, or 0.5TiB total, in .602 seconds, with a
total bandwidth of 830 GiB/s to all the L1 caches in parallel.  With
SSE 128-bit instructions, the same code takes 1.084 seconds.

Unfortunately, when hashing 2GiB of external DRAM, there's not any
significant improvement in my code using AVX2 vs SSE.  It seems I'm
memory bandwidth limited either way.

Bill

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ