phc-discussions - Re: [PHC] Lyra2 initial review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Fri, 18 Apr 2014 00:10:02 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Lyra2 initial review

On Thu, Apr 17, 2014 at 03:32:25PM -0400, Bill Cox wrote:
> You're right.  I skimmed the code, and I saw:
> 
> /* 6: for i = 0 to N - 1 do */
>                 for (i = 0; i < Nloop; i += 2) {
> ...
> 
> The comment made me think it would execute N-1 times, but that's just the
> Script comment.  I went back and read your documentation on Nloop in your
> writeup, and it's clear enough.  I just missed it before.

I can see how the comment can be confusing.  Since yescrypt builds upon
scrypt quite closely, I kept the numbered algorithm lines and comments
from scrypt, and in the writeup I am adding unnumbered algorithm lines
inbetween those, to show just what is added compared to scrypt.  Perhaps
at some point it'd be cleaner to define yescrypt on its own, rather than
as a "diff from scrypt".

> BTW, I think having a partial second loop is an excellent idea.  Your use
> of t_cost in this way wins.  It's even better than just having an outer
> loop like garlic, without increasing memory.  I see Lyra2 does something
> similar.  In both Lyra2 and Yescript, TMTO resistance increases with
> t_cost.  Very nice.

I like this too.  Besides TMTO resistance, also average to peak memory
usage ratio increases with higher t_cost.  This wouldn't be happening
with Catena's garlic.

> I tried r=32, and here are the results:

Thanks!  Did you also try r=16?  In some of my tests, r=16 was actually
faster than r=32, although that was with 2 threads/core.  r=32 needs
about 16 KiB of L1 cache per thread, because it's 4 KiB for current
block (X or V_i), 4 KiB for random block (V_j), and 8 KiB for pwxform
S-boxes, so we're bumping into 32 KiB L1 cache size with 2 threads/core.
This is not yet the case with 1 thread/core, though.

> Lyra2:
> Peak memory/second: 1.52 GiB/s
> Average memory/second: 1.14 GiB/s

These look correct to me, given the Lyra2 algorithm on page 11 in
Lyra2ReferenceGuide.pdf for T=1.

> Memory bandwidth: 10.64 GiB/s

This assumes a total of 7 reads plus writes per location, which I'm not
sure is correct.  Looking at the algorithm, it might be even more, but
that would push the memory bandwidth usage even higher, which would be
surprisingly high.  It might be 4 r/w per iteration in Filling Loop, and
5 r/w per iteration in Visitation Loop, so 9 total.  If this is correct,
we'd see 13.68 GiB/s here.  This might be realistic because, if I
understand correctly, the same rand is reused with and without rotW(),
which halves the amount of crypto-like processing per memory access even
compared to reduced round TwoCats and yescrypt.  (If one round of
pwxform is crypto-like, whereas one bit-rotate is not.)

BTW, yescrypt at zero rounds of pwxform might well be a reasonable thing
to use, if we're to maximize memory bandwidth usage while not even
trying to keep any compute time hardening (besides the occasional Salsa20).
I just haven't studied it enough to recommend it as an option, and I am
not aware of a use case where zero compute time hardening would be good.
When making the rounds count tunable (and encoding it in the flags
field), I am not sure if I should make zero a supported value.

> Yescrypt: Salsa20/2, 1 PWXFORM round, r=32
> Peak memory/second: 3.09 GiB/s
> Average memory/second: 1.93 GiB/s
> Memory bandwidth: 8.23 GiB/s
> 
> TwoCats: multiplies=1
> Peak memory/second: 4.64 GiB/s
> Average memory/second: 2.23 GiB/s
> Memory bandwidth: 9.28 GiB/s
> 
> I agree that average memory*time is the most important metric.  TwoCats is
> only ahead by about 15% on that scale.  If I want more compute time
> hardening, I should set multiplies to 2, in which case I'd probably run
> slower.  In any case, my machine changes it's mind on runtimes by 15%
> depending on it's mood it seems, so we're getting down to a level of speed
> improvement that even I can't get excited about.
> 
> I'm still sticking to my "alternate win condition" based on peak memory *
> time :-)

Sounds good.

Alexander