lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 28 Jul 2015 07:19:49 +0000
From: Jean-Philippe Aumasson <>
To: "" <>
Subject: Re: [PHC] Argon2 improvement thread


Updated and refactored the document at

@Dmitry: feel free to comment within the document.

On Mon, Jul 27, 2015 at 11:02 PM Dmitry Khovratovich <>

> On Jul 27, 2015, at 17:39, Jean-Philippe Aumasson <
>> wrote:
> Dmitry, Alex: what about writing a MAXFORM patch for Argon2 and measure
> things? or can we decide without?
> I will write one shortly within a few days.
> Dmitry: ok for a single API with different flags for 2d and 2i? (if so,
> we'll update
> )
> Yes, Ok with that.
> On Sun, Jul 26, 2015 at 4:27 AM Solar Designer <> wrote:
>> On Tue, Jul 21, 2015 at 12:50:52PM -0700, Bill Cox wrote:
>> > On Tue, Jul 21, 2015 at 12:45 PM, Solar Designer <>
>> wrote:
>> > > MAXFORM is the scalar equivalent to (and subset of) pwxform.  It's
>> > > neither parallel, nor wide, but is otherwise the same.
>> > >
>> > > It would co-exist with Argon2's existing SIMD code.
>> >
>> > Nice!  I was getting excellent multiplication-chain hardening with a
>> > similar approach.  The scalar pipeline has a faster multiplier.
>> Actually, which multiplier is faster varies by CPU type and mode.
>> In x86-64 builds, the scalar code may have to use 64x64->64 when it
>> actually only needs 32x32->64, and this may be an extra clock cycle.
>> Per my notes, SIMD 32x32->64 is 3 to 5 cycles, and scalar 32x32->64
>> where available is also 3 to 5 cycles, but the 64x64->64 is 3 to 6
>> cycles.  That's across what I deemed are currently relevant x86(-64) CPUs.
>> But yes, on many recent x86-64 CPUs scalar is faster (like 3 or 4
>> cycles) and SIMD is slower (like 5 cycles).  On some older ones (but
>> still relevant), it's the other way around.
>> > > Switching Argon2 to use pwxform would be too much of a change - not
>> > > code-wise, but design-wise.  If we were to do that, then it'd be
>> better
>> > > to go with (simplified) yescrypt or the like instead, which we already
>> > > have separately (just not as the PHC winner).
>> > >
>> > > Alexander
>> >
>> > Got it.  In that case, I'm for the MAXFORM upgrade.  I agree that
>> > Bcrypt-like GPU resistance is a critical defense.  For example, without
>> it,
>> > I would have to use Yescrypt in the PoW systems I've been playing with
>> > rather than Argon2.  With MAXFORM, would Argon2's GPU resistance be as
>> > strong as Yescrypt's?
>> The short answer is: not exactly as strong (with differences possible in
>> either direction), but similar.
>> Agnieszka's latest benchmarks of yescrypt on GPU, along with the bcrypt
>> on GPU benchmarks we had before, help me answer this more precisely.
>> As can be seen from Agnieszka's benchmarks, the current pwxform default
>> of 128-bit lookups vs. bcrypt's 32-bit ones appears to have actually hurt
>> performance 4x-ish on AMD GCN GPUs, compared to bcrypt's (which runs at
>> CPU-like speed on those GPUs).  This is consistent with my understanding
>> of GCN's local memory ports (lots of 32-bit wide ports, so when we make
>> wider lookups, we use multiple ports at once).  However, this didn't
>> make much of a difference on NVIDIA Kepler GPUs (bcrypt runs at similar
>> poor speed on those that we're getting for yescrypt now).  For NVIDIA
>> Maxwell, I am not ready to answer - Agnieszka's Maxwell GPU is small,
>> and I don't readily have bcrypt speed numbers for it.
>> With MAXFORM, the lookups will be 64-bit, which is in-between.
>> At the same time, the parallelism will be lower (in yescrypt, we
>> currently have 4 parallel 128-bit lanes, resulting in a total of 4x2 = 8
>> parallel S-box lookups, whereas in MAXFORM it's just one chain with 2
>> parallel S-box lookups).  The effect of this will probably vary across
>> GPU types and attack implementations.  On one hand, this may allow us to
>> increase the latency hardening compared to yescrypt's (use more MAXFORM
>> rounds than yescrypt would use pwxform rounds), although it looks like
>> we're not going to (on the contrary, we may prefer to minimize the
>> overhead of having MAXFORM chain added).  On the other hand, this means
>> less competition for access ports to whatever memory stores the S-boxes
>> in an attacker's device such as a GPU.
>> The lower parallelism will definitely help harden defensive
>> implementations on older and smaller CPUs, as well as scalar builds
>> (where the Blake2b is also using scalar code) even when running on
>> recent SIMD-capable CPUs.  This is actually a concern I have about
>> yescrypt's current pwxform settings: they're worse than bcrypt when
>> faced with those SIMD-less CPUs/builds.  MAXFORM (or obviously tuning
>> yescrypt's pwxform like it) addresses this.  Argon2 with MAXFORM will
>> have this concern addressed all the time...  OK, that's not exactly
>> true: setting the MAXFORM rounds count low in order to expose Argon2's
>> parallelism has roughly the same effect as setting yescrypt's PWXsimple
>> and PWXgather high in order to have the parallelism.  So it's similar.
>> Alexander

Content of type "text/html" skipped

Powered by blists - more mailing lists