[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150906084631.GB31208@openwall.com>
Date: Sun, 6 Sep 2015 11:46:31 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Low Argon2 performance in L3 cache
On Sat, Sep 05, 2015 at 01:59:37PM -0700, Bill Cox wrote:
> I agree. Yescrypt is a "recommended" solution, so for this 4MiB hash size,
> I think I should see if I can get the speed where it needs to be for me to
> argue easily for it's use. As I recall, there were several possible tweaks
> to get it running essentially on par with TwoCat's speed, which
> unfortunately I need to make a simple argument.
>
> Is Yescrypt essentially finished?
For the purpose of benchmarking, almost. For other purposes, not yet -
but it will be really soon.
> What would you recommend I try to get a
> a single-thread hash of 4MiB under 1ms?
I never had that as a goal (like I said, 1ms vs. 2ms just does not
matter; it's throughput that matters), but here are yescrypt benchmarks
on i7-4770K at 4 MiB for different PWXrounds:
PWXrounds = 6 (default):
Benchmarking 1 thread ...
477 c/s real, 477 c/s virtual (511 hashes in 1.07 seconds)
Benchmarking 8 threads ...
1533 c/s real, 191 c/s virtual (1533 hashes in 1.00 seconds)
PWXrounds = 2:
Benchmarking 1 thread ...
983 c/s real, 993 c/s virtual (1023 hashes in 1.04 seconds)
Benchmarking 8 threads ...
1705 c/s real, 214 c/s virtual (3069 hashes in 1.80 seconds)
PWXrounds = 1:
Benchmarking 1 thread ...
1346 c/s real, 1355 c/s virtual (2047 hashes in 1.52 seconds)
Benchmarking 8 threads ...
1764 c/s real, 231 c/s virtual (2047 hashes in 1.16 seconds)
So you can get to 1.02ms at 2 rounds, and to 0.74ms at 1 round, when
computing just 1 hash at a time on that CPU.
Also, as you can see the throughput figures don't differ as much,
increasing by only 15% as you go from 6 rounds to 1 round. This is why
I prefer 6.
The above is without use of AVX2. With AVX2 and 512-bit (rather than
128-bit) pwxform S-box lookups (but still 8 KiB S-boxes):
PWXrounds = 6:
Benchmarking 1 thread ...
500 c/s real, 505 c/s virtual (511 hashes in 1.02 seconds)
Benchmarking 8 threads ...
1663 c/s real, 216 c/s virtual (3577 hashes in 2.15 seconds)
PWXrounds = 2:
Benchmarking 1 thread ...
1071 c/s real, 1077 c/s virtual (2047 hashes in 1.91 seconds)
Benchmarking 8 threads ...
1749 c/s real, 230 c/s virtual (2047 hashes in 1.17 seconds)
PWXrounds = 1:
Benchmarking 1 thread ...
1421 c/s real, 1421 c/s virtual (2047 hashes in 1.44 seconds)
Benchmarking 8 threads ...
1764 c/s real, 236 c/s virtual (2047 hashes in 1.16 seconds)
As you can see, we get under 1ms at 2 rounds then, and the difference in
throughput between 6 rounds and 1 round becomes even smaller (only 6%).
It can also be seen that AVX2 doesn't make a big difference, not even
when we've tuned the S-box lookups width. (It makes more of a
difference for throughput at 2 MiB, though: 3400 vs. 4100.) This is a
reason why I am primarily considering AVX2 and the wider S-box lookups
along with targeting L2 (rather than L1) cache with the S-boxes. Here
are the results for 64 KiB S-boxes (up from 8 KiB above):
PWXrounds = 6:
Benchmarking 1 thread ...
331 c/s real, 331 c/s virtual (511 hashes in 1.54 seconds)
Benchmarking 8 threads ...
1502 c/s real, 192 c/s virtual (1533 hashes in 1.02 seconds)
PWXrounds = 2:
Benchmarking 1 thread ...
695 c/s real, 695 c/s virtual (1023 hashes in 1.47 seconds)
Benchmarking 8 threads ...
1677 c/s real, 214 c/s virtual (3069 hashes in 1.83 seconds)
PWXrounds = 1:
Benchmarking 1 thread ...
938 c/s real, 947 c/s virtual (1023 hashes in 1.09 seconds)
Benchmarking 8 threads ...
1743 c/s real, 224 c/s virtual (3069 hashes in 1.76 seconds)
As you can see, it's almost the same throughput, and it is still
possible to get to around 1ms at 1 round, although the latency for 6
rounds is now up from 2ms to 3ms (but we really shouldn't care).
Out of these benchmarks, I like 64 KiB, 6 rounds best: 1502 c/s, whereas
the highest was 1764 c/s, so only 17.4% higher. And who cares about the
0.7ms vs. 3ms latency for the case when only 1 core is in use? Even if
there's a low latency budget for a given deployment, it's latency
measured under high load (thus, with multiple cores in use) that will
determine whether a given set of parameters meets the budget or not.
128 KiB S-boxes:
PWXrounds = 6:
Benchmarking 1 thread ...
288 c/s real, 288 c/s virtual (511 hashes in 1.77 seconds)
Benchmarking 8 threads ...
1299 c/s real, 166 c/s virtual (1533 hashes in 1.18 seconds)
256 KiB S-boxes:
PWXrounds = 6:
Benchmarking 1 thread ...
200 c/s real, 200 c/s virtual (255 hashes in 1.27 seconds)
Benchmarking 8 threads ...
1043 c/s real, 132 c/s virtual (1785 hashes in 1.71 seconds)
1299 is 86.5% of 1502 that we had at 64 KiB.
1043 is 80.3% of 1299, and 69.4% of 1502.
128 KiB means fully using the 256 KiB L2 caches when running 2 instances
per core, like we do here. 256 KiB means we're exceeding the L2 caches.
Alexander
Powered by blists - more mailing lists