phc-discussions - Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140111185053.GA6887@openwall.com>
Date: Sat, 11 Jan 2014 22:50:53 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

Bill,

On Sat, Jan 11, 2014 at 09:52:32PM +0400, Solar Designer wrote:
> On Sat, Jan 11, 2014 at 09:01:20PM +0400, Solar Designer wrote:
> > So r=32 (4 KB) appears optimal in this test.
> > 
> > r=32 and Salsa20 rounds count reduced to 1:
> > 
> > real    0m5.362s
> > user    0m39.046s
> > sys     0m2.588s
> > 
> > 2*3*10*2^30/10^9/5.362 = ~12 GB/s
[...]
> Turns out that with the settings above, the prefetches of to-be-written
> locations were no longer beneficial (they were with r=8 and 2+ rounds).
> Without them:
> 
> real    0m5.259s
> user    0m38.106s
> sys     0m2.684s

All of the above tests were on FX-8120 with 2x DDR3-1600.

Exact same source code recompiled without XOP (with pure AVX) and
running on i7-4770K with 4x DDR3-1600 (but only 2 memory channels, so
same 25.6 GB/s theoretical peak bandwidth):

real    0m4.310s
user    0m32.706s
sys     0m0.996s

2*3*10*2^30/10^9/4.310 = 14.95 GB/s

Same code on 2x E5-2670 with 8x DDR3-1600 ECC Reg (8 channels, so
102.4 GB/s theoretical peak bandwidth):

real    0m1.564s
user    0m49.006s
sys     0m0.390s

I did not limit the number of threads, and it appears OpenMP chose to
have extra threads spinning.  The test was set to p=8, so it couldn't
reasonably use more than 8 threads.  Anyway, this gives:

2*3*10*2^30/10^9/1.564 = 41.19 GB/s

With OMP_NUM_THREADS=8:

real    0m2.254s
user    0m17.327s
sys     0m0.514s

I don't get why limiting the number of threads to what the code can
actually use (with its p=8 setting) reduces performance on this test,
but both results are reproducible.  Anyway, this one corresponds to:

2*3*10*2^30/10^9/2.254 = 28.58 GB/s

All of the tests above produced the same hash values.  Now let's break
compatibility with previous test hashes by setting p=32.  Running that
with 8 threads gives:

real    0m2.290s
user    0m17.578s
sys     0m0.515s

Roughly same speed as for p=8.  That's good.  But now we can go for 32
threads for real:

real    0m1.293s
user    0m39.497s
sys     0m1.141s

Same hashes computed as in the test above, but much faster:

2*3*10*2^30/10^9/1.293 = 49.83 GB/s

So it appears that all 3 systems got to ~50% of theoretical peak, with
the exception of i7-4770K, which got to 14.95/25.6 = 58.4%.

On my sequential memory read benchmark with SSE2 instructions (unrolled
loop, plenty of parallelism, no computation), I am getting ~85 GB/s for
32 concurrent threads on the 2x E5-2670.  If we use that as the potential
speed, then 49.83 GB/s is 58.6% of the max.

Back to 2 rounds of Salsa20:

real    0m1.316s
user    0m39.959s
sys     0m1.353s

2*3*10*2^30/10^9/1.316 = 48.95 GB/s

4 rounds:

real    0m1.412s
user    0m43.291s
sys     0m1.130s

2*3*10*2^30/10^9/1.412 = 45.63 GB/s

8 rounds:

real    0m1.694s
user    0m52.293s
sys     0m1.068s

2*3*10*2^30/10^9/1.694 = 38.03 GB/s

12 rounds:

real    0m2.117s
user    1m6.071s
sys     0m0.757s

2*3*10*2^30/10^9/2.117 = 30.43 GB/s

20 rounds:

real    0m3.233s
user    1m41.802s
sys     0m0.549s

2*3*10*2^30/10^9/3.233 = 19.93 GB/s

If we stay with pure Salsa20, then maybe 8, 4, or 2 rounds is optimal.
There's only a 2% speedup from going further to 1 round.

In fact, if we were to regard only the Salsa20 rounds as computation
that the attacker will need to perform, then even round counts higher
than 8 could be more optimal in attack area*time terms, from defender's
point of view.  Since GB/s is proportional to memory usage for fixed
running time (or actually the other way around), we can estimate AT
costs (in arbitrary units) as follows:

48.95 GB/s * 2 rounds = 97.90
45.63 GB/s * 4 rounds = 182.52
38.03 GB/s * 8 rounds = 304.24
30.43 GB/s * 12 rounds = 365.16
19.93 GB/s * 20 rounds = 398.60

However, in practice the "overhead" XORs and ADDs are probably no less
costly than those that are part of Salsa20 rounds, so 4 or 8 rounds may
be good.  2 rounds does feel potentially worse in terms of significantly
reduced computation (and thus potentially similarly reduced "time"
factor for the attacker with a more suitable hardware architecture).

Alexander