phc-discussions - Re: [PHC] Argon2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 31 Mar 2015 15:50:07 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Argon2

Hi Dmitry,

Thank you for your help reviewing yescrypt!

On Tue, Mar 31, 2015 at 12:11:28PM +0200, Dmitry Khovratovich wrote:
> To produce a 1 KB-block,
> Argon2d applies the Blake2b round 16 times in total. 1 Blake2b round
> is equivalent to 2 rounds of Salsa20, extended to 64-bit words.
> 
> As I understand the yescrypt pseudocode, r=8 yields 1 KB blocks.
> BlockMIX then applies pwxform r1=16 times, and Salsa20/8 one time.
> 
> The latency of the compression function/block of Argon2d is 2 Blake2b
> rounds. Since the compression function has full diffusion (every input
> bit of 1 KB-block affects every output bit), any output byte has also
> latency of 2 Blake2b rounds.
> 
> In yescrypt, the SIMD lanes are mixed only in the Salsa20 call, which

Agreed so far.

> means that the block latency is not 16 pwxform calls+ 8 Salsa rounds,
> but far smaller, only the equivalent of 4 pwxform+8 Salsa rounds.

No.  Where does the number 4 come from?

Processing of every sub-block depends on the previous processed sub-block.

> This is for the block; all the subblocks but the last one do not need
> Salsa,

Yes, typically most sub-blocks except for one or a few last ones (or
exactly the last one with the provided compile-time settings, as you
correctly say) do not need Salsa.

> so their individual latencies are only 4 pwxform.

No.  They depend on processing of the same lanes in previous sub-blocks.

At least that's the intent.  If you found an error preventing this data
dependency from working as intended, please let me know.

> Thus indeed, yescrypt block function has higher latency, though not
> that much higher compared to Argon2d (2-3x for most of the blocks).

I disagree.  I continue to think it's 16x more sequential rounds in
yescrypt in Bill's test.

> However, the yescrypt's BlockMIX is not memory-hard, as most of it can
> be computed using only 16 bytes of RAM (not including the storage) by
> computing it lanewise. This property might be mitigated by subblock
> shuffling, but I have not analyzed it.

Yes, it can be computed lanewise up until the final Salsa20/8, but
that's a non-issue.  In fact, this flexibility may be somewhat useful on
register-starved archs.  pwxform() in yescrypt-opt.c actually partly
takes advantage of this: it has the loops re-ordered to be
gather{rounds{simple}} rather than rounds{gather{simple}}, for some
slight speedup on register-starved archs.  yescrypt-opt.c does not take
this to the BlockMix level, but it could.

Do you think this has any drawback, and why?

Yes, there is a local memory vs. computation tradeoff here, but the only
place where yescrypt actually tries to depend on having much local
memory is with pwxform's S-boxes.  The current block isn't meant to be
tied to local memory for sane performance.  Also, its typical size is
just 1 KB anyway, and it's accessed sequentially anyway (so its reads
from global memory wouldn't be that slow), compared to the typical 8 KB
for the S-boxes, which are accessed randomly (16-byte lookups by
default).  So this tradeoff doesn't appear lucrative to likely
attackers, and it is allowing for relatively minor local memory needs
reduction anyway (like to 8.5/9 with the typical numbers I gave here if
paying for this with a 2x increase in computation time).

What bothers me more is the simplification I made between yescrypt v0
and v1 in how the S-boxes are initialized.  There's now an equivalent of
classic scrypt's TMTO there, but applied to the S-boxes.  My expectation
is that this tradeoff is also usually not lucrative, but it isn't as
non-lucrative as the above.  For example, an attacker who reduced the
local memory needs for the S-boxes from 8 KB to 4 KB by not storing
every other 64-byte block (since they're initialized in such blocks)
would incur a Salsa20/8 (yes, 8 rounds) per every other S-box lookup, of
which there are two per pwxform round per gather lane.  So for them one
pwxform round per gather lane would effectively turn into Salsa20/8, on
average.  With 4 gather lanes per 64-byte sub-block (the current
default) and just 1 pwxform round, that's 4x more work than classic
scrypt did, and it's still on top of also needing 4 KB S-boxes in local
memory.  For 2 pwxform rounds as in Bill's benchmark, that's 8x more
work than classic scrypt's.  For 6 pwxform rounds as in the current PHC
submission's defaults, that's 24x more work than classic scrypt's.
That's just too much work (and latency) to do so frequently.  So I
accepted this simplicity vs. local memory TMTO resistance tradeoff in
favor of simplicity in v1.  But like I say, this does bother me a bit.

> Do I understand yescrypt right?

I think not.  Perhaps I need to describe it better, including the
expected latencies.

> > That's cool, but that's 8192-bit parallelism, right?
> 
> I would like to comment on this as well, but I do not quite understand
> how you measure the parallelism. Maybe the compression function
> picture in Argon2 specification helps.

I was looking primarily at your code.  I saw 16 BLAKE2 rounds, of which
2 groups of 8 had data dependencies between the groups.  This appears to
be consistent with what you're saying.  Right?

Thanks again,

Alexander