lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 4 Jan 2014 10:07:58 +0400
From: Solar Designer <>
Subject: Re: [PHC] Reworked KDF available on github for feedback: NOELKDF

On Sat, Jan 04, 2014 at 09:27:24AM +0400, Solar Designer wrote:
> On Fri, Jan 03, 2014 at 03:12:40PM -0500, Bill Cox wrote:
> > I gave it a name this time.  It's based on the experience so far from
> > keystretch, but is simpler, and even faster.  It doesn't use a "key" to
> > hash pages, but instead treats the previously generated "page" as the key
> > for generating the next one by hashing it with a random previous page.  On
> > my Corei7 server, it runs about 15% slower than memmove (which is faster
> > than memcpy because 1/2 the memory is allocated).  2GB of hashing takes
> > 0.35 seconds.  Both are multi-threaded 2-way, which is the fastest case on
> > both machines.
> This is very nice speed.

I don't recall if you had provided exact specs for your machines, but
assuming 2x DDR3-1600, the memory bandwidth is 25.6 GB/s (theoretical
peak).  0.35s for 2 GB writes and random reads gives 12 GB/s, so there's
still room for improvement.  (Indeed, you'd have to use SSE instructions
to achieve full memory bandwidth.  In my benchmarks of memory bandwidth
alone, I generally get 80% to 85% of theoretical peak by using an
unrolled loop with movdqa and running enough threads.)

> Salsa20 rounds reduced to 2:
> real    0m0.854s
> user    0m3.136s
> sys     0m2.536s
> Let's try to compute the hash 10 times per program invocation, to move
> the memory allocation overhead out of the loop (is only done once here):
> real    0m6.159s
> user    0m45.363s
> sys     0m2.584s
> Now it's down to ~0.6s per hash, which is better than 0.35s claimed for
> NOELKDF, given escrypt's 2x iteration count (and thus cost to attacker).

In escrypt running in this mode, SMix first loop has sequential writes
and random reads, and second loop has random reads only.  So it's 3*N
memory accesses (not counting accesses to data that is presumed to be in
L1 cache).  This gives ~10 GB/s, suggesting that there's also room for

It is unclear whether much of that potential improvement may be achieved
without reducing the amount of other work (other than accessing
non-cached memory) that we do.  Would improving instruction scheduling,
to better inter-mix accesses to cached and non-cached data (something a
compiler currently doesn't consider for us) as well as memory+cache
accesses and computation (something a compiler should be taking care of
already, when building with function inlining and loop unrolling enabled),
achieve a lot?  I doubt it, but it may be worth trying (tricky).

BTW, these recent benchmarks I ran were using 4 KB pages, I think.
Moving to 2 MB pages may provide some speedup (for both KDFs), as I had
observed on my other benchmarks.


Powered by blists - more mailing lists