phc-discussions - yescrypt throughput vs. PWXrounds

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150403180559.GA27352@openwall.com>
Date: Fri, 3 Apr 2015 21:05:59 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: yescrypt throughput vs. PWXrounds

Bill, all -

FWIW, here's what I am getting on FX-8120 with 2x DDR3-1600 (I should
probably re-do this on more machines).  The first column is number of
rounds of pwxform (the current default is 6), followed by throughput in
hashes/second for 8 threads / 1 thread for 2 MB, 128 MB, and 2 MB RAM +
2 GB ROM.  For the multi-thread throughput figures, the threads are
independent (simulating an authentication server) and the total amount
of RAM is what's shown times the number of threads (so 1 GB for the
8-thread tests in the 128 MB column).

rounds  2 MB            128 MB          2 MB + 2 GB ROM
6       2772 / 511      30 / 7          2592 / 486
4       3653 / 691      32 / 9          3269 / 647
2       5340 / 1077     33 / 13         4288 / 974
1       6454 / 1451     33 / 15         4760 / 1255

As you can see, when using only RAM and being out of cache and running
as many threads as the hardware supports (8 on this CPU), there's only
a 10% speedup possible from reducing PWXrounds from 6 to 1.  OTOH, when
the machine is under-loaded, running only 1 thread, there's a 2x+
speedup possible (7 to 15 hashes/second in 1 thread).  I optimized for
best behavior when server capacity is reached (because that's what
limits the cost settings), as well as for multi-threaded KDF use.  For
this, the choice of 6 rounds still looks good to me.  BTW, looking at
these numbers another way, it's 3 GB memory filled (and 8 GB of
bandwidth used) in 1 second, despite of the high PWXrounds setting.
This can be improved to 3.3 GB (and 9 GB bandwidth usage).  Worth it?
I'd rather opt for the 10% lower memory and bandwidth usage figure, but
gain diversity of defense (3x or 6x higher compute hardening).

When much of the RAM portion fits in a cache, there's significant
speedup from lower PWXrounds, even when running 8 threads.  However, the
speedup is not enough to keep the compute hardening per time the same.
For example, 2772*6 / (3653*4) = 1.14, but 6/4 = 1.5, and
2772*6 / (5340*2) = 1.56, but 6/2 = 3.  So going for PWXrounds = 2 would
halve the compute hardening per time.  Maybe that's OK, but I wouldn't
be able to claim that yescrypt achieves bcrypt-like frequency(*) of its
S-box lookups and thus is at least as GPU-unfriendly as bcrypt even at
the lowest m_cost settings.  Would being no more than 2x worse than
bcrypt still be OK?  I'm not sure.  I would be uncomfortable about that,
even though bcrypt isn't one of the PHC finalists. ;-)

(*) Also considered are parallelism of the S-box lookups and total size
of the S-boxes.

Should we have PWXrounds (auto-)tuned differently for the
single-threaded case?  With password hashing use, yescrypt being invoked
with p=1 doesn't mean there isn't another instance running concurrently.
In fact, in terms of capacity planning we should assume that there are
as many such instances as the hardware supports.  Should we have some
kind of heuristics (or a flag?) to determine KDF use (e.g., size of
512 MB or more?), and if p=1 then reduce PWXrounds?  This feels like too
much complexity and unexpected behavior, and yescrypt is too complex as
it is.

While I don't mind auto-tuning of PWXgather and PWXsimple for the
current machine (and getting them encoded along with the hashes or e.g.
with the encrypted filesystem), auto-tuning of PWXrounds is different
(will vary by other yescrypt parameters and expected system load, rather
than only by underlying CPU).

Alexander