| lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
|
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAOLP8p5+3Uj5AVinYhs8z99uFtQ3Z7gJKNF5QONcbEutgPmBnA@mail.gmail.com> Date: Sun, 23 Feb 2014 15:00:36 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Initial AVX2 benchmarking Solar Designer has generously allowed me to do a bit of development and benchmarking on his Haswell machine, which supports Intel's latest and greatest vector processing extensions, AVX2. This expands the vector instructions from 4 32-bit words in parallel to 8 32-bit words, at least in the way I'm using it. Here's the benchmark that blows my mind. This, tigerkdf> time ./tigerkdf -m256 -M0 -r$((1024*1024)) -t12 -b4096 -B0 garlic:0 memorySize:256 multipliesPerKB:0 repetitions:1048576 numThreads:12 blockSize:4096 subBlockSize:0 Password:password Salt:salt 01 0d 55 38 e3 ef ca d8 57 9a dd 92 53 1b 2a 99 cc 30 cc ed 77 87 ea 74 a7 16 bb 66 9e e7 74 1d 32 (octets) real 0m0.602s user 0m4.464s sys 0m0.008s This is 12 threads hammering L1 cache in a hashing loop over and over. It's reading 0.25TiB twice, or 0.5TiB total, in .602 seconds, with a total bandwidth of 830 GiB/s to all the L1 caches in parallel. With SSE 128-bit instructions, the same code takes 1.084 seconds. Unfortunately, when hashing 2GiB of external DRAM, there's not any significant improvement in my code using AVX2 vs SSE. It seems I'm memory bandwidth limited either way. Bill
Powered by blists - more mailing lists