phc-discussions - Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF available on github for feedback: NOELKDF)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Sat, 11 Jan 2014 16:42:10 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF
 available on github for feedback: NOELKDF)

On Sat, Jan 11, 2014 at 2:23 PM, Solar Designer <solar@...nwall.com> wrote:
> New speed for 1 Salsa20 round and p=32, on idle machine:
>
> real    0m1.286s
> user    0m39.523s
> sys     0m1.183s
>
> Not much change here. :-)
>
> 2*3*10*2^30/10^9/1.286 = 50.10 GB/s

Holy cow!  That's serious speed, dude.  The best graphics, FPGA, or
ASIC attack could not do more than 4-5X better.

>> In fact, if we were to regard only the Salsa20 rounds as computation
>> that the attacker will need to perform, then even round counts higher
>> than 8 could be more optimal in attack area*time terms, from defender's
>> point of view.  Since GB/s is proportional to memory usage for fixed
>> running time (or actually the other way around), we can estimate AT
>> costs (in arbitrary units) as follows:
>>
>> 48.95 GB/s * 2 rounds = 97.90
>> 45.63 GB/s * 4 rounds = 182.52
>> 38.03 GB/s * 8 rounds = 304.24
>> 30.43 GB/s * 12 rounds = 365.16
>> 19.93 GB/s * 20 rounds = 398.60

Imagine that all the lines of code you've written in C for escrypt,
other than for memory access, will cost an attacker maybe $0.02 per
core, and that Salsa20/20 will run in maybe 2 clocks instead of 1 for
Salsa20/1.  This is roughly the case, within maybe 2X in cost and
clock cycles.  It's pointless trying to make an ASIC attacker's
arithmetic cost significant in either time or money, so we have to
focus on memory size and bandwidth.  For defense against ASIC attacks,
your Salsa20/1 benchmark is the best.  50GB/s!  Nice!

Bill

On Sat, Jan 11, 2014 at 2:23 PM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Sat, Jan 11, 2014 at 10:50:53PM +0400, Solar Designer wrote:
>> I don't get why limiting the number of threads to what the code can
>> actually use (with its p=8 setting) reduces performance on this test,
>
> Got it: I totally forgot that I had another 32-thread job running on the
> machine, at the lowest priority (nice +19).  So when I ran only 8 new
> threads, they shared cores with that other job, whereas when I ran all
> 32 new threads, they almost fully replaced the other job (apparently,
> OpenMP's spinning threads were taking up less resources than that other
> job's actual computation threads).
>
> With that other job stopped, I get the following numbers for 1 Salsa20
> round, p=8, and only 8 running threads:
>
> real    0m1.483s
> user    0m11.398s
> sys     0m0.383s
>
> 2*3*10*2^30/10^9/1.483 = 43.44 GB/s
>
> Ditto, 2 rounds:
>
> real    0m1.723s
> user    0m13.331s
> sys     0m0.370s
>
> 2*3*10*2^30/10^9/1.723 = 37.39 GB/s
>
> 4 rounds:
>
> real    0m2.485s
> user    0m19.404s
> sys     0m0.383s
>
> 2*3*10*2^30/10^9/2.485 = 25.93 GB/s
>
> 8 rounds:
>
> real    0m4.256s
> user    0m33.547s
> sys     0m0.367s
>
> 2*3*10*2^30/10^9/4.293 = 15.01 GB/s
>
> So it appears that on the idle machine (as it should have been), 8
> threads are almost enough to use the 8 memory channels, but not enough
> to reach similar GB/s speeds when Salsa20 rounds count is increased.
>
>> Roughly same speed as for p=8.  That's good.  But now we can go for 32
>> threads for real:
>>
>> real    0m1.293s
>> user    0m39.497s
>> sys     0m1.141s
>
> New speed for 1 Salsa20 round and p=32, on idle machine:
>
> real    0m1.286s
> user    0m39.523s
> sys     0m1.183s
>
> Not much change here. :-)
>
> 2*3*10*2^30/10^9/1.286 = 50.10 GB/s
>
>> Back to 2 rounds of Salsa20:
>>
>> real    0m1.316s
>> user    0m39.959s
>> sys     0m1.353s
>>
>> 2*3*10*2^30/10^9/1.316 = 48.95 GB/s
>
> real    0m1.305s
> user    0m40.109s
> sys     0m1.242s
>
> 49.37 GB/s
>
>> 4 rounds:
>>
>> real    0m1.412s
>> user    0m43.291s
>> sys     0m1.130s
>>
>> 2*3*10*2^30/10^9/1.412 = 45.63 GB/s
>
> real    0m1.404s
> user    0m43.245s
> sys     0m1.229s
>
> 45.89 GB/s
>
>> 8 rounds:
>>
>> real    0m1.694s
>> user    0m52.293s
>> sys     0m1.068s
>>
>> 2*3*10*2^30/10^9/1.694 = 38.03 GB/s
>
> real    0m1.684s
> user    0m52.329s
> sys     0m1.109s
>
> 38.26 GB/s
>
> So the same conclusions still apply:
>
>> If we stay with pure Salsa20, then maybe 8, 4, or 2 rounds is optimal.
>> There's only a 2% speedup from going further to 1 round.
>>
>> In fact, if we were to regard only the Salsa20 rounds as computation
>> that the attacker will need to perform, then even round counts higher
>> than 8 could be more optimal in attack area*time terms, from defender's
>> point of view.  Since GB/s is proportional to memory usage for fixed
>> running time (or actually the other way around), we can estimate AT
>> costs (in arbitrary units) as follows:
>>
>> 48.95 GB/s * 2 rounds = 97.90
>> 45.63 GB/s * 4 rounds = 182.52
>> 38.03 GB/s * 8 rounds = 304.24
>> 30.43 GB/s * 12 rounds = 365.16
>> 19.93 GB/s * 20 rounds = 398.60
>
> ... but these are with all cores in use.  With only 8 threads running,
> lower Salsa20 round counts are more beneficial.
>
>> However, in practice the "overhead" XORs and ADDs are probably no less
>> costly than those that are part of Salsa20 rounds, so 4 or 8 rounds may
>> be good.  2 rounds does feel potentially worse in terms of significantly
>> reduced computation (and thus potentially similarly reduced "time"
>> factor for the attacker with a more suitable hardware architecture).
>
> Alexander