[<prev] [next>] [day] [month] [year] [list]
Message-ID: <541492D2.30407@ciphershed.org>
Date: Sat, 13 Sep 2014 14:54:10 -0400
From: Bill Cox <waywardgeek@...hershed.org>
To: discussions@...sword-hashing.net
Subject: A review per day - Skipping Argon
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
... because I did such a review just before starting this. In fact,
picking so strongly on Argon made me feel I was being unfair. To be
fair, I have to pick on everyone :-)
For reference, I'll paste my previous review below, and after that a
post where I found that Argon is highly parallelizable. I have also
started recommending whether or not I feel an algorithm belongs in the
second round. While Argon is well written and I respect the author
for good work, I personally do not feel Argon needs to be in the
second round, given the strength of competing algorithms.
Old review follows
- ------------------
Argon has been on my list as a well written, well thought out PHC
entry, and the author even invited me a while back to benchmark it,
which I will do a bit here.
First, it was developed with modern SIMD architectures in mind. To
me, this differentiates a lot of the candidates from the others. It
is simply not possible to make a great high performance password
hashing function without understanding the impact of SIMD
instructions. The Argon author not only took SIMD into account, but
also used the AES-NI instructions to good effect. Because of the
attention paid to efficiency, Argon is relatively fast compared to
many submissions. It is possible to use it for 1GiB hashes on modern
PCs without a totally unreasonable delay (closer to 1 second rather
than 1 minute is basically my threshold). One of the best parts of
Argon is it's simplicity. Even the optimized code is easy to read,
and I found no bugs.
When run with t_cost = 1, as I would generally prefer, I would call
this a "hybrid" design. The first loop, SubGroups, does no password
dependent addressing, and is cache-timing attack resistant during this
pass. In the second loop, ShuffleSlices, data derived from the
password is used to pick pseudo-random locations for swapping. This
provides much stronger password defense against off-line brute-force
attacks than a purely cache-timing resistant algorithm, while
providing good defense against brute-force password guessing even if
the attacker has cache-timing data.
Minor nit picks:
- - Calling memset to clear the Input array does not work. Check out
Black2b's secure_zero function
- - Unindent the #ifdef on line 394 of the optimized linux code.
- - To compile on Ubuntu 14.04, I had to change -lpthread to -pthread
The code does a clean job hashing all input parameters. It was a bit
tricky convincing myself that he did not leave any room for collisions
with different input parameters, but I believe it now.
Bigger concerns:
- - Fixed parallelism benefits attackers
The built-in 32-way parallelism can rarely be fully used by users, but
attackers can always use it. This issue can be fixed by adding the
number of threads to the Input array, and using only that number of
"groups", rather than hard-coding it to 32. The thread count *has* to
be an input parameter that effects the output unless we want to give
the attacker a free max parallelism bonus.
- - Lack of initial key derivation forces compute intensive memory
operations, and leaves sensitive data scattered through memory for too
long.
The Input array is properly built. However, the right thing to do at
that point is to create a cryptographic hash summary of it, and zero
it out. This intermediate derived key should then be used to
initialize memory.
As it is, Argon copies sensitive data from the Input array to various
places in state memory, making it far more likely to be written to
disk on a hibernate, or if swapped. The Input array is not cleared
until memory is fully initialized. The clear text Input data in state
memory is not scrambled until the SubGroups operations in the next
memory pass. The entire memory is first zeroed by Linux, then written
in the initialization, then read during SubGroups while doing compute
intensive AES rounds, and finally memory is over-written.
The right place to clear the input data is after deriving the
intermediate key, and before allocating state memory.
Because there is highly sensitive data in memory, Argon *must*
scramble all of memory with cryptographically strong hashing or risk
revealing the password. This excessive hashing is primarily what
makes Argon run slow.
- - Memory access pattern benefits attackers with custom cache memories
An attacker with a custom cache architecture could benefit from the
32x parallelism in ShuffleSlices. Because of the memory interleaved
access pattern, users will see these operations running mostly
sequentially, regardless of the number of threads.
The ShuffleSlices algorithm defines a "slice" as memory locations
separated by 32. There are 32 such slices in the "state". First,
multi-threading performance for is terrible when the threads access
the same memory in an interleaved pattern. Second, these accesses
will be done mostly sequentially, even though there is room for 32X
parallelism. An attacker with a custom cache architecture can add an
extra read/write port to the cache and organize memory into 32 smaller
caches containing only data at multiples of 32 apart. The attacker
with a custom cache will make full use of the 32x parallelism. This
parallelism can be applied to password hashing whether doing a 16MiB
cache-bound hash, or a 1GiB hash of external memory.
- - The memory hashing is too slow
Every even memory location has it's data scrambled with the next
higher location. A clever scheme reduces the latency to 5 AES rounds
rather than 10, likely without sacrificing security. However, what
exactly is the point of securely hashing memory? In Argon's case,
sensitive data was stored there such as the password in plain text, so
memory has to be cryptographically hashed. Had only data created from
the derived key been written to memory, a much faster 1-round of AES
could have been used.
- - Intel specific AES-NI instructions are used
These instructions are awesome, but limit the applicability of Argon.
For large password server based authentication, the server can be
required to support AES-NI. However, for an application such as
full-disk-encryption, there are lots of users who will not have access
to these instructions. As a result, Argon would be far slower than
Script for these users.
Benchmarks
- ----------
I ran this on my 3.4GiH quad-core Ivy Bridge Core i7 processor, with
8GiB CORSAIR Vengence 1600MHz DDR3 memory. I tweaked the number of
threads to minimize delay.
16MiB run:
Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 14 -tcost 1
- -pwdlen 64 -saltlen 16 -threads 5
Memory allocated: 16 MBytes, 5 threads
Argon: 5.57 cpb 86.99 Mcycles 0.0614 seconds
real 0m0.028s
user 0m0.059s
sys 0m0.003s
1GiB run:
Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 20 -tcost 1
- -pwdlen 64 -saltlen 16 -threads 6
Memory allocated: 1024 MBytes, 6 threads
Argon: 6.58 cpb 6580.35 Mcycles 7.4787 seconds
real 0m2.035s
user 0m7.383s
sys 0m0.097s
This isn't too bad. The 1GiB runtime is a bit slower than Script with
1 thread, though, so I do not consider this an improvement over Script
performance. The 32X built-in parallelism I believe will make Argon
more susceptible to GPU attacks than Script as well, at least for the
SubGroup portion. The pseudo-random 16-byte accesses in ShuffleSlices
should provide some GPU defense, but because they are over the whole
memory, users will also be slowed down considerably by these accesses.
While the code is well written, I do not feel Argon currently provides
better security than Script.
For comparison, here's TwoCat's benchmarks run in the same way:
16MiB run:
twocats> time ./twocats -m 14 -P1
hash:blake2s memCost:14 timeCost:0 multiplies:2 lanes:8 parallelism:1
algorithm:twocats-extended password:password salt:salt blockSize:16384
subBlockSize:64
97 95 79 57 81 2d 77 40
a5 b5 db 98 b7 ea 26 f9
ad fc 7f ea 14 c4 40 8e
66 7f 34 ff 0e 3c 39 69 32 (octets)
real 0m0.007s
user 0m0.005s
sys 0m0.003s
1GiB run:
twocats> time ./twocats -m 20 -P2
hash:blake2s memCost:20 timeCost:0 multiplies:2 lanes:8 parallelism:2
algorithm:twocats-extended password:password salt:salt blockSize:16384
subBlockSize:64
3b f7 f6 60 e2 44 96 9c
4c a9 b2 ed 87 27 0b 58
23 0c 5f 58 35 07 2b 82
24 c8 9f dc f9 d5 c2 b8 32 (octets)
real 0m0.171s
user 0m0.252s
sys 0m0.084s
Every factor of 10 counts
Argon is highly parallelizable
- ------------------------------
The SubGroup call is "embarrassingly parallelizable", meaning for N
blocks of 32 128-bit state values, we can run it in parallel N ways.
Each "SubGroup" hash on groups of 32 128-bit states is independent from
the others.
The ShuffleSlices is parallelizable well beyond the intended 32-way
parallelism. First, run all 32 slices in parallel. Then, for each
slice, compute many sequential j values in parallel. This is not a
problem if we build a custom cache that presents many sequential state
values at once to each slice computation unit. Parallel addition of
all the j values is then a simple matter, Once they are computed, we
can compute the destination addresses for swapping in parallel, and
even do the swapping in parallel with our custom cache. The only
problem is if any destination j address is within the set of states we
added together in parallel, in which case we abort and do this group
in sequentially.
With the right choice for number of j values, aborting should not
happen often enough to reduce the runtime. For an extra 16-way
parallelism on a 16MiB hash (total of 512 parallel cores, since we
have 32 parallel slices), only about 1 in 128 blocks of 32 would have
to be computed sequentially.
Memory bandwidth could be a limiting factor, but a cache architecture
for delivering 32 pseudo-random values in parallel is doable with
reasonable latency.
Not only that, but Argon can be run in constant time and memory with
cache timing data, regardless of m_cost and t_cost settings. An
attacker only needs to hash the first 512 bytes before he can
replicate data dependent reads. This is a far worse parallel attack
with cache timing data than the one the Catena paper outlined against
Script.
I had not realized that the Argon authors claimed that t_cost == 3 is
the minimum safe number of rounds for 1GiB, and for 10MiB, it was 236.
Running the benchmarks again with these numbers:
10MiB case:
Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 10 -tcost
234 -pwdlen 64 -saltlen 16 -threads 5
Memory allocated: 1 MBytes, 5 threads
Argon: 346.15 cpb 338.04 Mcycles 0.3923 seconds
real 0m0.106s
user 0m0.347s
sys 0m0.047s
1GiB case:
Linux-AES-NI> time Argon-Optimized -taglength 32 -logmcost 20 -tcost 3
- -pwdlen 64 -saltlen 16 -threads 8
Memory allocated: 1024 MBytes, 8 threads
Argon: 13.25 cpb 13250.55 Mcycles 18.4738 seconds
real 0m4.097s
user 0m18.329s
sys 0m0.146s
I claim my 7ms TwoCats hash of 16MiB is far more secure against brute
force off-line guessing attacks, with or without cache timing data.
Given a full 100ms, I'd hash over 500MiB on this machine, providing
even more security. Yescrypt can do this just as fast, when
appropriate settings are used. Lyra2 does pretty well, though not as
fast.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAEBAgAGBQJUFJLPAAoJEAcQZQdOpZUZ3QMP/jDW3UB7NwrIzihrdLCgrR2T
OfbiX9apGL9HShKVa7gXLsEh2lWpSLmfShMX1H2VkWf0ODHpuUzF6jYpuQMO3zvT
/yPJ2L9+Ahi2x8XyzaVqLFAEkFwQeM3reIQsVZrHUzcJLXXuOfIZHjcrFv9YmU/2
aykj3hmPTbG0dWfY0JH1ijVc5aQzoN65ISwF1Qv1COD0oOtzWtF4Pjyq60iE+Daw
AapAnIxjsc/W7H0qJgrJuW8sh1EzpcIxrw4rSyM8M++rCaQWpZDa1yYj/JXq7i8v
csmus5retllRmBmNonGySOheyNvxOyILyZZlLgdJVdMOJNroKN0a3F4sOgN1qiib
tsXovs8An4Ya1+2MFpFP6HtKYvGAKj2qAxHeVpdll6aZihEWU+ddpKakxb3AzlVl
4vtXCgFNvt8gk6ymY/bHvdR4M3j74reBld7rt20MaXRaPIydLvpyVYAjeZyY/ZmD
vFEwR2bAgw/UmIoUbLswC7BUidJwZgbtY7ocESbA0ZCM36vUQiQVDCmlLOvGnAI1
JygpN1ffXKAO4aanoIaVLBlAYEKmLVg5unxlqIhedg5PjVcn+raHlHB5NU4f4SOS
mav98Q4d+P2CwkJ4mtC0sCiGd4TSVhgPz8v62s62k+PoJ3C8FO3CVAgzymL8SXyV
F6Z33jpVcGyTpyI4D006
=LajT
-----END PGP SIGNATURE-----
Powered by blists - more mailing lists