[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150331125007.GA2715@openwall.com>
Date: Tue, 31 Mar 2015 15:50:07 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Argon2
Hi Dmitry,
Thank you for your help reviewing yescrypt!
On Tue, Mar 31, 2015 at 12:11:28PM +0200, Dmitry Khovratovich wrote:
> To produce a 1 KB-block,
> Argon2d applies the Blake2b round 16 times in total. 1 Blake2b round
> is equivalent to 2 rounds of Salsa20, extended to 64-bit words.
>
> As I understand the yescrypt pseudocode, r=8 yields 1 KB blocks.
> BlockMIX then applies pwxform r1=16 times, and Salsa20/8 one time.
>
> The latency of the compression function/block of Argon2d is 2 Blake2b
> rounds. Since the compression function has full diffusion (every input
> bit of 1 KB-block affects every output bit), any output byte has also
> latency of 2 Blake2b rounds.
>
> In yescrypt, the SIMD lanes are mixed only in the Salsa20 call, which
Agreed so far.
> means that the block latency is not 16 pwxform calls+ 8 Salsa rounds,
> but far smaller, only the equivalent of 4 pwxform+8 Salsa rounds.
No. Where does the number 4 come from?
Processing of every sub-block depends on the previous processed sub-block.
> This is for the block; all the subblocks but the last one do not need
> Salsa,
Yes, typically most sub-blocks except for one or a few last ones (or
exactly the last one with the provided compile-time settings, as you
correctly say) do not need Salsa.
> so their individual latencies are only 4 pwxform.
No. They depend on processing of the same lanes in previous sub-blocks.
At least that's the intent. If you found an error preventing this data
dependency from working as intended, please let me know.
> Thus indeed, yescrypt block function has higher latency, though not
> that much higher compared to Argon2d (2-3x for most of the blocks).
I disagree. I continue to think it's 16x more sequential rounds in
yescrypt in Bill's test.
> However, the yescrypt's BlockMIX is not memory-hard, as most of it can
> be computed using only 16 bytes of RAM (not including the storage) by
> computing it lanewise. This property might be mitigated by subblock
> shuffling, but I have not analyzed it.
Yes, it can be computed lanewise up until the final Salsa20/8, but
that's a non-issue. In fact, this flexibility may be somewhat useful on
register-starved archs. pwxform() in yescrypt-opt.c actually partly
takes advantage of this: it has the loops re-ordered to be
gather{rounds{simple}} rather than rounds{gather{simple}}, for some
slight speedup on register-starved archs. yescrypt-opt.c does not take
this to the BlockMix level, but it could.
Do you think this has any drawback, and why?
Yes, there is a local memory vs. computation tradeoff here, but the only
place where yescrypt actually tries to depend on having much local
memory is with pwxform's S-boxes. The current block isn't meant to be
tied to local memory for sane performance. Also, its typical size is
just 1 KB anyway, and it's accessed sequentially anyway (so its reads
from global memory wouldn't be that slow), compared to the typical 8 KB
for the S-boxes, which are accessed randomly (16-byte lookups by
default). So this tradeoff doesn't appear lucrative to likely
attackers, and it is allowing for relatively minor local memory needs
reduction anyway (like to 8.5/9 with the typical numbers I gave here if
paying for this with a 2x increase in computation time).
What bothers me more is the simplification I made between yescrypt v0
and v1 in how the S-boxes are initialized. There's now an equivalent of
classic scrypt's TMTO there, but applied to the S-boxes. My expectation
is that this tradeoff is also usually not lucrative, but it isn't as
non-lucrative as the above. For example, an attacker who reduced the
local memory needs for the S-boxes from 8 KB to 4 KB by not storing
every other 64-byte block (since they're initialized in such blocks)
would incur a Salsa20/8 (yes, 8 rounds) per every other S-box lookup, of
which there are two per pwxform round per gather lane. So for them one
pwxform round per gather lane would effectively turn into Salsa20/8, on
average. With 4 gather lanes per 64-byte sub-block (the current
default) and just 1 pwxform round, that's 4x more work than classic
scrypt did, and it's still on top of also needing 4 KB S-boxes in local
memory. For 2 pwxform rounds as in Bill's benchmark, that's 8x more
work than classic scrypt's. For 6 pwxform rounds as in the current PHC
submission's defaults, that's 24x more work than classic scrypt's.
That's just too much work (and latency) to do so frequently. So I
accepted this simplicity vs. local memory TMTO resistance tradeoff in
favor of simplicity in v1. But like I say, this does bother me a bit.
> Do I understand yescrypt right?
I think not. Perhaps I need to describe it better, including the
expected latencies.
> > That's cool, but that's 8192-bit parallelism, right?
>
> I would like to comment on this as well, but I do not quite understand
> how you measure the parallelism. Maybe the compression function
> picture in Argon2 specification helps.
I was looking primarily at your code. I saw 16 BLAKE2 rounds, of which
2 groups of 8 had data dependencies between the groups. This appears to
be consistent with what you're saying. Right?
Thanks again,
Alexander
Powered by blists - more mailing lists