lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20150906084631.GB31208@openwall.com> Date: Sun, 6 Sep 2015 11:46:31 +0300 From: Solar Designer <solar@...nwall.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] Low Argon2 performance in L3 cache On Sat, Sep 05, 2015 at 01:59:37PM -0700, Bill Cox wrote: > I agree. Yescrypt is a "recommended" solution, so for this 4MiB hash size, > I think I should see if I can get the speed where it needs to be for me to > argue easily for it's use. As I recall, there were several possible tweaks > to get it running essentially on par with TwoCat's speed, which > unfortunately I need to make a simple argument. > > Is Yescrypt essentially finished? For the purpose of benchmarking, almost. For other purposes, not yet - but it will be really soon. > What would you recommend I try to get a > a single-thread hash of 4MiB under 1ms? I never had that as a goal (like I said, 1ms vs. 2ms just does not matter; it's throughput that matters), but here are yescrypt benchmarks on i7-4770K at 4 MiB for different PWXrounds: PWXrounds = 6 (default): Benchmarking 1 thread ... 477 c/s real, 477 c/s virtual (511 hashes in 1.07 seconds) Benchmarking 8 threads ... 1533 c/s real, 191 c/s virtual (1533 hashes in 1.00 seconds) PWXrounds = 2: Benchmarking 1 thread ... 983 c/s real, 993 c/s virtual (1023 hashes in 1.04 seconds) Benchmarking 8 threads ... 1705 c/s real, 214 c/s virtual (3069 hashes in 1.80 seconds) PWXrounds = 1: Benchmarking 1 thread ... 1346 c/s real, 1355 c/s virtual (2047 hashes in 1.52 seconds) Benchmarking 8 threads ... 1764 c/s real, 231 c/s virtual (2047 hashes in 1.16 seconds) So you can get to 1.02ms at 2 rounds, and to 0.74ms at 1 round, when computing just 1 hash at a time on that CPU. Also, as you can see the throughput figures don't differ as much, increasing by only 15% as you go from 6 rounds to 1 round. This is why I prefer 6. The above is without use of AVX2. With AVX2 and 512-bit (rather than 128-bit) pwxform S-box lookups (but still 8 KiB S-boxes): PWXrounds = 6: Benchmarking 1 thread ... 500 c/s real, 505 c/s virtual (511 hashes in 1.02 seconds) Benchmarking 8 threads ... 1663 c/s real, 216 c/s virtual (3577 hashes in 2.15 seconds) PWXrounds = 2: Benchmarking 1 thread ... 1071 c/s real, 1077 c/s virtual (2047 hashes in 1.91 seconds) Benchmarking 8 threads ... 1749 c/s real, 230 c/s virtual (2047 hashes in 1.17 seconds) PWXrounds = 1: Benchmarking 1 thread ... 1421 c/s real, 1421 c/s virtual (2047 hashes in 1.44 seconds) Benchmarking 8 threads ... 1764 c/s real, 236 c/s virtual (2047 hashes in 1.16 seconds) As you can see, we get under 1ms at 2 rounds then, and the difference in throughput between 6 rounds and 1 round becomes even smaller (only 6%). It can also be seen that AVX2 doesn't make a big difference, not even when we've tuned the S-box lookups width. (It makes more of a difference for throughput at 2 MiB, though: 3400 vs. 4100.) This is a reason why I am primarily considering AVX2 and the wider S-box lookups along with targeting L2 (rather than L1) cache with the S-boxes. Here are the results for 64 KiB S-boxes (up from 8 KiB above): PWXrounds = 6: Benchmarking 1 thread ... 331 c/s real, 331 c/s virtual (511 hashes in 1.54 seconds) Benchmarking 8 threads ... 1502 c/s real, 192 c/s virtual (1533 hashes in 1.02 seconds) PWXrounds = 2: Benchmarking 1 thread ... 695 c/s real, 695 c/s virtual (1023 hashes in 1.47 seconds) Benchmarking 8 threads ... 1677 c/s real, 214 c/s virtual (3069 hashes in 1.83 seconds) PWXrounds = 1: Benchmarking 1 thread ... 938 c/s real, 947 c/s virtual (1023 hashes in 1.09 seconds) Benchmarking 8 threads ... 1743 c/s real, 224 c/s virtual (3069 hashes in 1.76 seconds) As you can see, it's almost the same throughput, and it is still possible to get to around 1ms at 1 round, although the latency for 6 rounds is now up from 2ms to 3ms (but we really shouldn't care). Out of these benchmarks, I like 64 KiB, 6 rounds best: 1502 c/s, whereas the highest was 1764 c/s, so only 17.4% higher. And who cares about the 0.7ms vs. 3ms latency for the case when only 1 core is in use? Even if there's a low latency budget for a given deployment, it's latency measured under high load (thus, with multiple cores in use) that will determine whether a given set of parameters meets the budget or not. 128 KiB S-boxes: PWXrounds = 6: Benchmarking 1 thread ... 288 c/s real, 288 c/s virtual (511 hashes in 1.77 seconds) Benchmarking 8 threads ... 1299 c/s real, 166 c/s virtual (1533 hashes in 1.18 seconds) 256 KiB S-boxes: PWXrounds = 6: Benchmarking 1 thread ... 200 c/s real, 200 c/s virtual (255 hashes in 1.27 seconds) Benchmarking 8 threads ... 1043 c/s real, 132 c/s virtual (1785 hashes in 1.71 seconds) 1299 is 86.5% of 1502 that we had at 64 KiB. 1043 is 80.3% of 1299, and 69.4% of 1502. 128 KiB means fully using the 256 KiB L2 caches when running 2 instances per core, like we do here. 256 KiB means we're exceeding the L2 caches. Alexander
Powered by blists - more mailing lists