phc-discussions - Re: [PHC] enhancing Argon2 (was: Competition process)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Mon, 20 Apr 2015 21:41:52 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] enhancing Argon2 (was: Competition process)

Dmitry, all -

On Mon, Apr 20, 2015 at 06:16:49AM +0300, Solar Designer wrote:
> On Sun, Apr 19, 2015 at 08:40:06AM +0300, Solar Designer wrote:
> > I now think it'd be best to use the same approach I had suggested and
> > Bill implemented for TwoCats.  Since you're already fully loading the
> > SIMD units with BLAKE2b rounds, use the scalar units for a single
> > pwxform lane chain.  This wouldn't really be pwxform - it would be
> > neither parallel, nor wide since it'd be locked to just one lane.  So no
> > tunable parallelism there.  But other than that, it'd be the same thing,
> > and by tuning the total number of rounds for this un-pwxform that you
> > perform per your 1 KB block, you'd achieve the equivalent of the desired
> > tunable latency and parallelism limitation.  All with just one parameter.
> > And no need to introduce data dependencies between your BLAKE2b rounds,
> > then.  So this replaces my two-bit tunable parallelism idea.
> 
> I've tried implementing and benchmarking this - please see the attached
> patch.  For latency hardening equivalent to yescrypt current defaults -
> so with 6 un-pwxform rounds per each of Argon2's BLAKE2b rounds (of
> which it has 16 per 1 KB) - the performance impact on FX-8120 (whatever
> I happened to test on) at 1 GB is only 15% for 1 thread, 5% for 8
> threads.  In fact, even at double the latency hardening (12 un-pwxform
> rounds per BLAKE2b round), the performance impact is 40% for 1 thread,
> but only 10% for 8 threads (relative to original Argon2).  This is for
> benchmarks inclusive of the memory allocation overhead, though.  The
> impact would be relatively higher with overhead excluded.

I inadvertently made the S-boxes size (total for two of them) 4 KB in
the above benchmarks.  I intended them to be 8 KB, like with yescrypt's
current defaults.  Correcting this (patch attached), I get 23%
performance impact for 1 thread, and 6% for 8 threads.  That's still
relative to unmodified Argon2, at 6 un-pwxform rounds, 1 GB.

A concern is that when the defensive running time is limited by this
scalar chain, we're making Argon2 more susceptible to CPU attacks, where
the attacker would interleave 2+ instances (and more RAM is typically
available in the system anyway).  This is partially mitigated by us
being close to bumping into L1 data cache size, but nevertheless it is a
concern.  For this reason, maybe a smaller default un-pwxform rounds
count (such as 3 or 4) should be chosen, especially at low (defensive)
thread counts.

> I think these numbers are very good.  I recommend that this change be
> completed (need to initialize the S-boxes properly, and decide on which
> state[] elements and how are passed through this chain) and merged in
> (also need to implement it into the non-SIMD version).

I still think so.

I now put the high 32 bits of the final output of the multiplication
chain into state[0]'s low 32 bits, which I think are then used for phi:

		phi = _mm_extract_epi32(prev_block[0], 0);

BTW, this SSE4.1 intrinsic (found in original argon2d-opt-sse.cpp) may
be replaced with SSE2's _mm_cvtsi128_si32().

Since we have 8 parallel BLAKE2b's, it may be elegant to split our
64-bit chain as 8 one-byte inputs/outputs into each BLAKE2b (perhaps in
addition to using 32-bit values in state[0] and state[63]).  We could
also merge them in-between the two groups of 8 BLAKE2b's.  Would this
help increase the latency of tradeoff attacks?  Without thinking of this
much yet, my gut feeling is that it won't, but it'd be some additional
state information that is normally not stored, and that the tradeoff
attacks will need to store (or they'd in fact incur a latency increase).
Very little of it, though.  So probably not worth the overhead.  Yet I
thought I'd bring this up for discussion.

To increase the latency of tradeoff attacks, I think BlaMka may be used
(along with an un-pwxform chain like this, which serves its different
purpose - hardening non-tradeoff latency and providing some anti-GPU).

Alexander

View attachment "Argon2-latency-hardening2.diff" of type "text/plain" (6592 bytes)