lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 3 Apr 2015 14:12:50 +0300
From: Solar Designer <>
Subject: Re: [PHC] Compute time hardness (pwxform,blake,blamka)

Hi Marcos,

Thank you for running these benchmarks.  They're very helpful.

On Thu, Apr 02, 2015 at 06:22:38PM -0300, Marcos Simplicio wrote:
> BTW, does pwxform provide full diffusion (i.e., one bit ends up being
> dependent on all other bits) in an unbiased manner? It may not be
> critical here, but it is usually a "good to have" requirement.

No, it does not - it provides no diffusion at all between the 64-bit
parallel lanes.  (Within the 64-bit lanes, it might.)

Full diffusion is provided at the end of each block, via Salsa20/8.

> - For 1 thread ("p1"), when both yescrypt and Lyra2 have the exact same
> number of passes through memory (Lyra2 with T=1 and yescrypt with T=2)
> and use an underlying hash with similar compute-time hardening (12 MULs,
> 12 ADDs and 12 XORs given by 1.5 BlaMka and pwxform) , then the lines
> almost overlap;  so I believe we are near to a draw.

This is nice to know.  It's kind of a sanity check for both designs.

There's a difference, though: I guess yescrypt's pwxform S-box lookups
are much more frequent than Lyra2's equivalent (is there one?)  If so,
yescrypt actually performs more attack-discouraging work in the same
amount of time here.  Can you compare Lyra2 and yescrypt in this respect?

Also, I never understood why you chose two full passes through memory as
the minimum for Lyra2.  Is this arbitrary?  For yescrypt, the 4/3 can be
shown as being the optimal stopping point assuming no AT-reducing TMTO,
so it's not arbitrary at all.

I currently recommend that yescrypt be used with t=0 unless more time
can be spent and more memory can't be in a given case.

> - For 1 thread ("p1") with different passes through memory: say,
> yescrypt with T=0 does 4/3 passes though memory with 12 MUL+ADD+XOR,
> while Lyra2 with T=1 and BlaMka makes 2 passes through memory with 8
> MUL+ADD+XOR are comparable (?) in terms of compute-time latency (4/3*12
> = 2*8 = 16). Lyra2 in this case is slightly (3%) slower. Given the low
> difference, the reason is probably related to bandwidth usage, because
> Lyra2 purposely uses more bandwidth in this scenario: namely, it makes 2
> writes per call of the underlying hash instead of 1, and 4 reads (2 from
> cache and 2 from memory) instead of 2. I'm not sure if that is good or
> bad, but improvements on memory bandwidth are usually slower than on
> computation power, so I would rather say it is good in terms of attack
> resistance, as it limits an attacker's parallelism ability.

This makes sense, but you also need to factor in yescrypt's S-box lookups.

> - For 2 threads ("p2"), Lyra2 is faster for similar compute-time
> hardening and also for higher compute-time hardening. That (and a very
> old e-mail by Bill) is why we think bandwidth is likely to be at play
> for p=1: we reduce the number of writes to 1 per thread during the
> Wandering phase when more threads are active.
> - For 4 threads ("p4"), Lyra2 is again faster for similar compute-time
> hardening, and also faster for higher compute-time hardening.

OK.  I expect yescrypt's S-box lookups come into play here.  In fact,
they also affect the 1 thread case, so I'm lucky yescrypt didn't appear
to run slower there.

Would you also benchmark Lyra2 vs. yescrypt for the maximum number of
threads supported in hardware on your machine, please?

What machine is this, for these benchmarks?  Are both Lyra2 and yescrypt
built for the same SIMD intrinsics (and which)?

There's significant room for improvement for yescrypt with AVX2, which I
haven't added yet.  In fact, on pre-AVX2 CPUs the compute hardening can
be doubled by reducing the parallelism from 512-bit to 256-bit, with
only a moderate slowdown, so the hardening per defensive running time
would likely be improved by 50% or so - but I preferred the defaults to
be balanced, with newer and near future CPUs in consideration as well,
and also with CPU attacks considered (where an attacker could bring the
extra parallelism in from having multiple candidate passwords to test).

Thanks again,


Powered by blists - more mailing lists