lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20150331125007.GA2715@openwall.com> Date: Tue, 31 Mar 2015 15:50:07 +0300 From: Solar Designer <solar@...nwall.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] Argon2 Hi Dmitry, Thank you for your help reviewing yescrypt! On Tue, Mar 31, 2015 at 12:11:28PM +0200, Dmitry Khovratovich wrote: > To produce a 1 KB-block, > Argon2d applies the Blake2b round 16 times in total. 1 Blake2b round > is equivalent to 2 rounds of Salsa20, extended to 64-bit words. > > As I understand the yescrypt pseudocode, r=8 yields 1 KB blocks. > BlockMIX then applies pwxform r1=16 times, and Salsa20/8 one time. > > The latency of the compression function/block of Argon2d is 2 Blake2b > rounds. Since the compression function has full diffusion (every input > bit of 1 KB-block affects every output bit), any output byte has also > latency of 2 Blake2b rounds. > > In yescrypt, the SIMD lanes are mixed only in the Salsa20 call, which Agreed so far. > means that the block latency is not 16 pwxform calls+ 8 Salsa rounds, > but far smaller, only the equivalent of 4 pwxform+8 Salsa rounds. No. Where does the number 4 come from? Processing of every sub-block depends on the previous processed sub-block. > This is for the block; all the subblocks but the last one do not need > Salsa, Yes, typically most sub-blocks except for one or a few last ones (or exactly the last one with the provided compile-time settings, as you correctly say) do not need Salsa. > so their individual latencies are only 4 pwxform. No. They depend on processing of the same lanes in previous sub-blocks. At least that's the intent. If you found an error preventing this data dependency from working as intended, please let me know. > Thus indeed, yescrypt block function has higher latency, though not > that much higher compared to Argon2d (2-3x for most of the blocks). I disagree. I continue to think it's 16x more sequential rounds in yescrypt in Bill's test. > However, the yescrypt's BlockMIX is not memory-hard, as most of it can > be computed using only 16 bytes of RAM (not including the storage) by > computing it lanewise. This property might be mitigated by subblock > shuffling, but I have not analyzed it. Yes, it can be computed lanewise up until the final Salsa20/8, but that's a non-issue. In fact, this flexibility may be somewhat useful on register-starved archs. pwxform() in yescrypt-opt.c actually partly takes advantage of this: it has the loops re-ordered to be gather{rounds{simple}} rather than rounds{gather{simple}}, for some slight speedup on register-starved archs. yescrypt-opt.c does not take this to the BlockMix level, but it could. Do you think this has any drawback, and why? Yes, there is a local memory vs. computation tradeoff here, but the only place where yescrypt actually tries to depend on having much local memory is with pwxform's S-boxes. The current block isn't meant to be tied to local memory for sane performance. Also, its typical size is just 1 KB anyway, and it's accessed sequentially anyway (so its reads from global memory wouldn't be that slow), compared to the typical 8 KB for the S-boxes, which are accessed randomly (16-byte lookups by default). So this tradeoff doesn't appear lucrative to likely attackers, and it is allowing for relatively minor local memory needs reduction anyway (like to 8.5/9 with the typical numbers I gave here if paying for this with a 2x increase in computation time). What bothers me more is the simplification I made between yescrypt v0 and v1 in how the S-boxes are initialized. There's now an equivalent of classic scrypt's TMTO there, but applied to the S-boxes. My expectation is that this tradeoff is also usually not lucrative, but it isn't as non-lucrative as the above. For example, an attacker who reduced the local memory needs for the S-boxes from 8 KB to 4 KB by not storing every other 64-byte block (since they're initialized in such blocks) would incur a Salsa20/8 (yes, 8 rounds) per every other S-box lookup, of which there are two per pwxform round per gather lane. So for them one pwxform round per gather lane would effectively turn into Salsa20/8, on average. With 4 gather lanes per 64-byte sub-block (the current default) and just 1 pwxform round, that's 4x more work than classic scrypt did, and it's still on top of also needing 4 KB S-boxes in local memory. For 2 pwxform rounds as in Bill's benchmark, that's 8x more work than classic scrypt's. For 6 pwxform rounds as in the current PHC submission's defaults, that's 24x more work than classic scrypt's. That's just too much work (and latency) to do so frequently. So I accepted this simplicity vs. local memory TMTO resistance tradeoff in favor of simplicity in v1. But like I say, this does bother me a bit. > Do I understand yescrypt right? I think not. Perhaps I need to describe it better, including the expected latencies. > > That's cool, but that's 8192-bit parallelism, right? > > I would like to comment on this as well, but I do not quite understand > how you measure the parallelism. Maybe the compression function > picture in Argon2 specification helps. I was looking primarily at your code. I saw 16 BLAKE2 rounds, of which 2 groups of 8 had data dependencies between the groups. This appears to be consistent with what you're saying. Right? Thanks again, Alexander
Powered by blists - more mailing lists