phc-discussions - Re: [PHC] Argon2 improvement thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150726022644.GA2212@openwall.com>
Date: Sun, 26 Jul 2015 04:26:44 +0200
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Argon2 improvement thread

On Tue, Jul 21, 2015 at 12:50:52PM -0700, Bill Cox wrote:
> On Tue, Jul 21, 2015 at 12:45 PM, Solar Designer <solar@...nwall.com> wrote:
> > MAXFORM is the scalar equivalent to (and subset of) pwxform.  It's
> > neither parallel, nor wide, but is otherwise the same.
> >
> > It would co-exist with Argon2's existing SIMD code.
> 
> Nice!  I was getting excellent multiplication-chain hardening with a
> similar approach.  The scalar pipeline has a faster multiplier.

Actually, which multiplier is faster varies by CPU type and mode.
In x86-64 builds, the scalar code may have to use 64x64->64 when it
actually only needs 32x32->64, and this may be an extra clock cycle.

Per my notes, SIMD 32x32->64 is 3 to 5 cycles, and scalar 32x32->64
where available is also 3 to 5 cycles, but the 64x64->64 is 3 to 6
cycles.  That's across what I deemed are currently relevant x86(-64) CPUs.

But yes, on many recent x86-64 CPUs scalar is faster (like 3 or 4
cycles) and SIMD is slower (like 5 cycles).  On some older ones (but
still relevant), it's the other way around.

> > Switching Argon2 to use pwxform would be too much of a change - not
> > code-wise, but design-wise.  If we were to do that, then it'd be better
> > to go with (simplified) yescrypt or the like instead, which we already
> > have separately (just not as the PHC winner).
> >
> > Alexander
> 
> Got it.  In that case, I'm for the MAXFORM upgrade.  I agree that
> Bcrypt-like GPU resistance is a critical defense.  For example, without it,
> I would have to use Yescrypt in the PoW systems I've been playing with
> rather than Argon2.  With MAXFORM, would Argon2's GPU resistance be as
> strong as Yescrypt's?

The short answer is: not exactly as strong (with differences possible in
either direction), but similar.

Agnieszka's latest benchmarks of yescrypt on GPU, along with the bcrypt
on GPU benchmarks we had before, help me answer this more precisely.

As can be seen from Agnieszka's benchmarks, the current pwxform default
of 128-bit lookups vs. bcrypt's 32-bit ones appears to have actually hurt
performance 4x-ish on AMD GCN GPUs, compared to bcrypt's (which runs at
CPU-like speed on those GPUs).  This is consistent with my understanding
of GCN's local memory ports (lots of 32-bit wide ports, so when we make
wider lookups, we use multiple ports at once).  However, this didn't
make much of a difference on NVIDIA Kepler GPUs (bcrypt runs at similar
poor speed on those that we're getting for yescrypt now).  For NVIDIA
Maxwell, I am not ready to answer - Agnieszka's Maxwell GPU is small,
and I don't readily have bcrypt speed numbers for it.

With MAXFORM, the lookups will be 64-bit, which is in-between.

At the same time, the parallelism will be lower (in yescrypt, we
currently have 4 parallel 128-bit lanes, resulting in a total of 4x2 = 8
parallel S-box lookups, whereas in MAXFORM it's just one chain with 2
parallel S-box lookups).  The effect of this will probably vary across
GPU types and attack implementations.  On one hand, this may allow us to
increase the latency hardening compared to yescrypt's (use more MAXFORM
rounds than yescrypt would use pwxform rounds), although it looks like
we're not going to (on the contrary, we may prefer to minimize the
overhead of having MAXFORM chain added).  On the other hand, this means
less competition for access ports to whatever memory stores the S-boxes
in an attacker's device such as a GPU.

The lower parallelism will definitely help harden defensive
implementations on older and smaller CPUs, as well as scalar builds
(where the Blake2b is also using scalar code) even when running on
recent SIMD-capable CPUs.  This is actually a concern I have about
yescrypt's current pwxform settings: they're worse than bcrypt when
faced with those SIMD-less CPUs/builds.  MAXFORM (or obviously tuning
yescrypt's pwxform like it) addresses this.  Argon2 with MAXFORM will
have this concern addressed all the time...  OK, that's not exactly
true: setting the MAXFORM rounds count low in order to expose Argon2's
parallelism has roughly the same effect as setting yescrypt's PWXsimple
and PWXgather high in order to have the parallelism.  So it's similar.

Alexander