phc-discussions - Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Sat, 11 Jan 2014 23:23:17 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

Bill,

On Sat, Jan 11, 2014 at 10:50:53PM +0400, Solar Designer wrote:
> I don't get why limiting the number of threads to what the code can
> actually use (with its p=8 setting) reduces performance on this test,

Got it: I totally forgot that I had another 32-thread job running on the
machine, at the lowest priority (nice +19).  So when I ran only 8 new
threads, they shared cores with that other job, whereas when I ran all
32 new threads, they almost fully replaced the other job (apparently,
OpenMP's spinning threads were taking up less resources than that other
job's actual computation threads).

With that other job stopped, I get the following numbers for 1 Salsa20
round, p=8, and only 8 running threads:

real    0m1.483s
user    0m11.398s
sys     0m0.383s

2*3*10*2^30/10^9/1.483 = 43.44 GB/s

Ditto, 2 rounds:

real    0m1.723s
user    0m13.331s
sys     0m0.370s

2*3*10*2^30/10^9/1.723 = 37.39 GB/s

4 rounds:

real    0m2.485s
user    0m19.404s
sys     0m0.383s

2*3*10*2^30/10^9/2.485 = 25.93 GB/s

8 rounds:

real    0m4.256s
user    0m33.547s
sys     0m0.367s

2*3*10*2^30/10^9/4.293 = 15.01 GB/s

So it appears that on the idle machine (as it should have been), 8
threads are almost enough to use the 8 memory channels, but not enough
to reach similar GB/s speeds when Salsa20 rounds count is increased.

> Roughly same speed as for p=8.  That's good.  But now we can go for 32
> threads for real:
> 
> real    0m1.293s
> user    0m39.497s
> sys     0m1.141s

New speed for 1 Salsa20 round and p=32, on idle machine:

real    0m1.286s
user    0m39.523s
sys     0m1.183s

Not much change here. :-)

2*3*10*2^30/10^9/1.286 = 50.10 GB/s

> Back to 2 rounds of Salsa20:
> 
> real    0m1.316s
> user    0m39.959s
> sys     0m1.353s
> 
> 2*3*10*2^30/10^9/1.316 = 48.95 GB/s

real    0m1.305s
user    0m40.109s
sys     0m1.242s

49.37 GB/s

> 4 rounds:
> 
> real    0m1.412s
> user    0m43.291s
> sys     0m1.130s
> 
> 2*3*10*2^30/10^9/1.412 = 45.63 GB/s

real    0m1.404s
user    0m43.245s
sys     0m1.229s

45.89 GB/s

> 8 rounds:
> 
> real    0m1.694s
> user    0m52.293s
> sys     0m1.068s
> 
> 2*3*10*2^30/10^9/1.694 = 38.03 GB/s

real    0m1.684s
user    0m52.329s
sys     0m1.109s

38.26 GB/s

So the same conclusions still apply:

> If we stay with pure Salsa20, then maybe 8, 4, or 2 rounds is optimal.
> There's only a 2% speedup from going further to 1 round.
> 
> In fact, if we were to regard only the Salsa20 rounds as computation
> that the attacker will need to perform, then even round counts higher
> than 8 could be more optimal in attack area*time terms, from defender's
> point of view.  Since GB/s is proportional to memory usage for fixed
> running time (or actually the other way around), we can estimate AT
> costs (in arbitrary units) as follows:
> 
> 48.95 GB/s * 2 rounds = 97.90
> 45.63 GB/s * 4 rounds = 182.52
> 38.03 GB/s * 8 rounds = 304.24
> 30.43 GB/s * 12 rounds = 365.16
> 19.93 GB/s * 20 rounds = 398.60

... but these are with all cores in use.  With only 8 threads running,
lower Salsa20 round counts are more beneficial.

> However, in practice the "overhead" XORs and ADDs are probably no less
> costly than those that are part of Salsa20 rounds, so 4 or 8 rounds may
> be good.  2 rounds does feel potentially worse in terms of significantly
> reduced computation (and thus potentially similarly reduced "time"
> factor for the attacker with a more suitable hardware architecture).

Alexander