phc-discussions - Low Argon2 performance in L3 cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p4G_VbYovrYTjffv_fFv9NRQBLWHLFNWXgsrX6tuy241w@mail.gmail.com>
Date: Thu, 3 Sep 2015 14:53:59 -0700
From: Bill Cox <waywardgeek@...il.com>
To: "discussions@...sword-hashing.net" <discussions@...sword-hashing.net>
Subject: Low Argon2 performance in L3 cache

Imagine you work at a large company like Facebook and want to convince your
data center guys to use Argon2.  They might have a 1 ms time budget for
password hashing, and be unwilling to budge on that.  In this case, you
really want the algorithm to fill memory rapidly.  Worse, you're sharing
the CPU with other services, so multiple threads are costly, again
upsetting the data center guys.

Here's a speed comparison of single-thread hashing of 4MiB between Argon2,
Yescrypt-2p, and TwoCats on my Xeon E5-1650 CPU running at 3.50GHz:

Argon2d: 2.6 ms
Yescrypt-2p: 1.8 ms
TwoCats: 0.72 ms

Assuming the attacker is L3 memory bandwidth bound, he will have a 13X
higher area*time cost if attacking TwoCats instead of Argon2.  I played
with various parameters to try and understand what's slowing down Argon2.
I also am having trouble getting excellent single-thread performance out of
Yescrpyt-2p for L3 bound hashing.  They break down roughly like this:

- The new Argon2d implementation is not yet well optimized for speed.
Argon2d from v.1.1 runs in 1.92ms.
    - Note that v.1.1 has no multiplications, while Yescrypt does.  When I
run TwoCats without multiplications, it takes 0.55 ms
- Argon2 is computation bound, even in v.1.1.  When I comment out the
second half of the BLAKE2_ROUND calls, it runs in 1.5ms.
- When I also comment out one call to G1 and one call to G2 in
BLAKE2_ROUND, it takes only 1.26ms.

I don't seem to be able to speed it up further than this, and I've already
done some scary things to the BLAKE2_ROUND function and Argon2's
compression function.  I think the speed difference may be due to how
Argon2's "state" variables do not fit into CPU registers, while TwoCats'
do.  I found in tuning TwoCats that this is absolutely critical for good
speed.  Having the state variables in L1 cache is fine for external DRAM
hashing, but not fast enough for L3 hashing.

A natural solution is to use more threads.  However, those other cores are
likely running threads of their own:

Argon2 v1.1.1 4 threads:  0.68 ms
TwoCats 4 threads, no multiplies: .35 ms

Wasn't Alexander getting something like 4,000 Yescrypt 4 MiB hashes per
second?  If true, this is very impressive.

What can be done to Argon2 to improve L3 hashing performance?

Bill

Content of type "text/html" skipped