phc-discussions - Re: [PHC] GPU benchmarks: Lyra2+yescrypt (was: Another PHC candidates "mechanical" tests (ROUND2))

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150507173247.GA14378@openwall.com>
Date: Thu, 7 May 2015 20:32:47 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] GPU benchmarks: Lyra2+yescrypt (was: Another PHC candidates "mechanical" tests (ROUND2))

Hi Marcos,

Thank you for working on this!  It's much appreciated.

I intend to take a closer look and provide a response to the specific
points you raised later.  Meanwhile:

I find the testing methodology weird and wrong.  You're reporting
microseconds.  You should be reporting hashes per second rates for full
device load, for both CPU and GPU.

At these low m_cost settings, typically neither the defender nor the
attacker cares if the total computation latency is, say, 0.5 ms or 2 ms.
It's still low enough.  What they care about is throughput.

In a similar spirit, I think we should lock p=1 for these tests, and add
the maximum amount of parallelism externally - much like it'd happen on
an authentication server or in an attack.  So e.g. on a quad-core Intel
CPU with HT, we should run 8 threads externally to both yescrypt and
Lyra2, both at p=1.  And keep the memory (de)allocation out of the loop,
if we can - or maybe report both kinds of benchmarks (with this overhead
included or excluded, as it can be either depending on how well a given
software integration or server deployment has been performed).

yescrypt defaults are tuned to provide good defense per defensive
throughput (such as request rate to an authentication server).  It is
quite pointless to test it at less than the maximum number of hardware
threads on CPU, especially at these low m_cost settings.  Also, p > 1 is
not for this use case (at least not on general-purpose CPUs yet; this
may change when CPUs start supporting so many hardware threads that
defensive latency at low load needs to be reduced).  It is for KDF use.

If Lyra2 defaults are tuned differently, and I think they are, that's a
drawback for this use case.  While we can reasonably imagine that some
full disk encryption app or such would happen to run fewer threads than
the hardware supports, this is not the case for authentication servers.
For those, request rate capacity (and thus the maximum cost settings
that we may set) is determined by what happens when they are fully
loaded, or even overloaded.

For yescrypt t=0, 2 MB, you report latency from 1750 us to 2700 us for
p=1 to p=4.  However, I know that e.g. on i7-4770K, it performs 3400
hashes/s (without use of AVX2) at 2 MB per thread.  Luckily, for the
purpose of this comparison, it's around 3400 for 4 or 8 threads -
doesn't matter.  (I think this is in part due to 2 MB being exactly the
L3 cache size per core, as well as due to yescrypt including slightly
excessive parallelism when run with 128-bit SIMD.)  That's for 8 MB or
16 MB total.  1750 us could suggest a throughput of only 570 per second,
which is 6 times lower than actual (and IIUC yours is for 2 MB total,
vs. 8 MB or 16 MB total for my 3400 figure).  But things don't really
work like that, which highlights the problem with the methodology.

I'd appreciate it if you re-do these benchmarks using proper metrics.

Thanks again,

Alexander