lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <551DB31E.4000604@larc.usp.br> Date: Thu, 02 Apr 2015 18:22:38 -0300 From: Marcos Simplicio <mjunior@...c.usp.br> To: discussions@...sword-hashing.net Subject: Re: [PHC] Compute time hardness (pwxform,blake,blamka) Hi, there. I renamed the thread because it started to ramify since yesterday. See inline. BR, Marcos. On 01-Apr-15 16:49, Solar Designer wrote: > On Wed, Apr 01, 2015 at 03:25:49PM -0300, Marcos Antonio Simplicio Junior wrote: >>> De: "Solar Designer" <solar@...nwall.com> > [...] >>> What are the comparable numbers for Lyra2, POMELO, and Catena? I >>> guess for Lyra2 and Catena it's 8 ADDs + 8 XORs per 128 bytes, right? >> >> If the underlying sponge is Blake2b, I think that number is correct. > > And this applies to Lyra2 as benchmarked e.g. by Milan so far, correct? Yes: this corresponds to Milan's benchmarks. We have additional benchs on Lyra2's reference guide, although they correspond to the pre-tweak phase for every finalist other than Lyra2 (so, they are probably of little use except to compare different Lyra2 parameters). > >> With BlaMka, which replaces simple ADDs in the G function (e.g., a \gets a + b) by Latin Squares (a \gets a+b+2ab), that would be 8 MUL + 16 ADDs + 8 XORs (as usual, ignoring the shifts, and multiplication by 2, which is also a shift) per 128 bytes. >> >> Well, that assuming I understood the question correctly ... > > I think you did. Thank you! > > I assume you mean the case SPONGE == 1, where Lyra2 would use > ROUND_LYRA_BLAMKA, not SPONGE == 2 where it'd use HALF_ROUND_LYRA_BLAMKA? Exactly. "HALF_ROUND_LYRA_BLAMKA" is kind of a test code that we did not remove... we were trying to see how many rounds of BlaMKa would have approximately the same speed as Blake2b, and 0.5 was the answer (but it provides only intra-column diffusion, while 1-round BlaMka gives full diffusion just like 1-round Blake). BTW, does pwxform provide full diffusion (i.e., one bit ends up being dependent on all other bits) in an unbiased manner? It may not be critical here, but it is usually a "good to have" requirement. > > A minor detail: a+b+2ab can be written as (a+b)+2ab, giving a latency > of MUL + just one ADD, not two ADDs (a+b is computed in parallel with > 2ab). So it'd be 8 MUL + 8 ADD + 8 XOR. Right? We are discussing latency, so you are absolutely right! > > That's 8 MULs per 128 bytes, vs. current yescrypt's default of 6 MULs > per 64 bytes. So it's 1.5 times weaker. Does Lyra2 with BlaMka run > faster or slower than yescrypt (with pwxform settings as currently > specified) in terms of memory processed per second? It needs to run 1.5 > times faster to win this game. ;-) That is the part I wanted to have experimental results before giving an answer. They are in the attached images: - For 1 thread ("p1"), when both yescrypt and Lyra2 have the exact same number of passes through memory (Lyra2 with T=1 and yescrypt with T=2) and use an underlying hash with similar compute-time hardening (12 MULs, 12 ADDs and 12 XORs given by 1.5 BlaMka and pwxform) , then the lines almost overlap; so I believe we are near to a draw. - For 1 thread ("p1") with different passes through memory: say, yescrypt with T=0 does 4/3 passes though memory with 12 MUL+ADD+XOR, while Lyra2 with T=1 and BlaMka makes 2 passes through memory with 8 MUL+ADD+XOR are comparable (?) in terms of compute-time latency (4/3*12 = 2*8 = 16). Lyra2 in this case is slightly (3%) slower. Given the low difference, the reason is probably related to bandwidth usage, because Lyra2 purposely uses more bandwidth in this scenario: namely, it makes 2 writes per call of the underlying hash instead of 1, and 4 reads (2 from cache and 2 from memory) instead of 2. I'm not sure if that is good or bad, but improvements on memory bandwidth are usually slower than on computation power, so I would rather say it is good in terms of attack resistance, as it limits an attacker's parallelism ability. - For 2 threads ("p2"), Lyra2 is faster for similar compute-time hardening and also for higher compute-time hardening. That (and a very old e-mail by Bill) is why we think bandwidth is likely to be at play for p=1: we reduce the number of writes to 1 per thread during the Wandering phase when more threads are active. - For 4 threads ("p4"), Lyra2 is again faster for similar compute-time hardening, and also faster for higher compute-time hardening. > > BTW, you can take advantage of the (a+b)+2ab latency optimization in > your defensive code as well. It might cost an extra register and a move > instruction, though, so it isn't universally an improvement. It might > improve the latency on new CPUs (Ivy Bridge and Bulldozer, and newer) > and make it worse on older CPUs (Sandy Bridge and older when running > non-AVX builds). That seems worth trying (actually, we need to find someone with better implementation skills than ourselves: most of Lyra2's code tricks come from Blake2b's original implementation...). Thanks! > > Thanks, > > Alexander > Download attachment "Lyra2xYescrypt_p4.png" of type "image/png" (261353 bytes) Download attachment "Lyra2xYescrypt_p2.png" of type "image/png" (296018 bytes) Download attachment "Lyra2xYescrypt_p1.png" of type "image/png" (327616 bytes)
Powered by blists - more mailing lists