phc-discussions - Argon is highly parallelizable...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <53FCCF75.6050705@ciphershed.org>
Date: Tue, 26 Aug 2014 14:18:29 -0400
From: Bill Cox <waywardgeek@...hershed.org>
To: discussions@...sword-hashing.net
Subject: Argon is highly parallelizable...

The SubGroup call is "embarrassingly parallelizable", meaning for N
blocks of 32 128-bit state values, we can run it in parallel N ways.
Each "SubGroup" hash on groups of 32 128-bit states is independent from
the others.

The ShuffleSlices is parallelizable well beyond the intended 32-way
parallelism.  First, run all 32 slices in parallel.  Then, for each
slice, compute many sequential j values in parallel.  This is not a
problem if we build a custom cache that presents many sequential state
values at once to each slice computation unit.  Parallel addition of
all the j values is then a simple matter,  Once they are computed, we
can compute the destination addresses for swapping in parallel, and
even do the swapping in parallel with our custom cache.  The only
problem is if any destination j address is within the set of states we
added together in parallel, in which case we abort and do this group
in sequentially.

With the right choice for number of j values, aborting should not
happen often enough to reduce the runtime.  For an extra 16-way
parallelism on a 16MiB hash (total of 512 parallel cores, since we
have 32 parallel slices), only about 1 in 128 blocks of 32 would have
to be computed sequentially.

Memory bandwidth could be a limiting factor, but a cache architecture
for delivering 32 pseudo-random values in parallel is doable with
reasonable latency.

Not only that, but Argon can be run in constant time and memory with
cache timing data, regardless of m_cost and t_cost settings.  An
attacker only needs to hash the first 512 bytes before he can
replicate data dependent reads.  This is a far worse parallel attack
with cache timing data than the one the Catena paper outlined against
Script.

I had not realized that the Argon authors claimed that t_cost == 3 is
the minimum safe number of rounds for 1GiB, and for 10MiB, it was 236.
 Running the benchmarks again with these numbers:

10MiB case:

Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 10 -tcost
234 -pwdlen 64 -saltlen 16 -threads 5
Memory allocated: 1 MBytes, 5 threads
Argon:  346.15 cpb 338.04 Mcycles 0.3923 seconds

real	0m0.106s
user	0m0.347s
sys	0m0.047s

1GiB case:

Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 20 -tcost 3
-pwdlen 64 -saltlen 16 -threads 8
Memory allocated: 1024 MBytes, 8 threads
Argon:  13.25 cpb 13250.55 Mcycles 18.4738 seconds

real	0m4.097s
user	0m18.329s
sys	0m0.146s

I claim my 7ms TwoCats hash of 16MiB is far more secure against brute
force off-line guessing attacks, with or without cache timing data.
Given a full 100ms, I'd hash over 500MiB on this machine, providing
even more security.  Yescrypt can do this just as fast, when
appropriate settings are used.  Lyra2 does pretty well, though not as
fast.

Bill