phc-discussions - Re: [PHC] TigerKDF paper and code ready for review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140308030349.GA29039@openwall.com>
Date: Sat, 8 Mar 2014 07:03:49 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] TigerKDF paper and code ready for review

On Fri, Mar 07, 2014 at 08:56:31PM -0500, Bill Cox wrote:
> On Fri, Mar 7, 2014 at 7:13 PM, Solar Designer <solar@...nwall.com> wrote:
> >> At least with AVX2 on Haswell, I would be surprised if Bcrypt's inner
> >> loop were faster, so for hashing out of just L1 cache, I'm probably
> >> good on that platform vs current GPUs.
> >
> > Are you trying to say that on Haswell you read 64 byte blocks as rapidly
> > as bcrypt reads 4 byte blocks (also on Haswell)?  I think this is false.
> 
> I need to verify this, but I think the math last time showed that a
> loop with 3 AVX2 instructions reading 2 32 byte values at a time,
> hashing them slightly, and writing them back ran at the clock speed.
> I should have said 32-bytes, not 64.

OK, you might be right.

For my optimized defensive implementation of bcrypt running on the
i7-4770K, I am getting 603 c/s at $2a$05 when running one thread (and
one instance), and 4330 c/s when running 8 threads.  (FWIW, attack
optimized implementations, with 2 or 3 instances per thread, are twice
faster when running one thread, or 1.5 times faster when running 8
threads.)

This gives:

603*34000*16*4/(3.9*10^9) = 0.336 lookups/cycle
4330*34000*16*4/(3.7*10^9) = 2.55 lookups/cycle

where 34000 is roughly the number of Blowfish encryptions performed,
16 is the number of Blowfish rounds, and 4 is the number of S-box
lookups per round.

(For attack, we'd have about 4 lookups/cycle on this CPU, meaning about
1 lookup/cycle/core.)

BTW, the defensive single-thread result above is pretty bad in how it
compares against historical Blowfish cycles per byte speeds.  It is:

3.9*10^9 / (34000*8*603) = 23.78 cpb

whereas the original Pentium did 18 cpb:

https://www.schneier.com/blowfish-speed.html

and my own assembly code for bcrypt on the original Pentium was inbetween:

120*10^6 / (34000*8*20.1) = 21.95 cpb

(This is for defensive, thread-safe code; attack and thread-unsafe code
was slightly faster, and is 1.5x to 2x faster on Haswell.)

Perhaps there's some room for further optimization/tuning for Haswell,
but overall these numbers should be about right.

Alexander