phc-discussions - Re: [PHC] "Attack on the iterative compression function"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150416010117.GB13118@openwall.com>
Date: Thu, 16 Apr 2015 04:01:17 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] "Attack on the iterative compression function"

Hi Dmitry,

Thank you very much for working on this!

On Wed, Apr 15, 2015 at 12:43:35PM +0200, Dmitry Khovratovich wrote:
> your original interpretation is correct. The idea is that even though
> the computation latency may be high, the recomputation latency can be
> very low if the attacker stores appropriate intermediate values. This
> applies to Lyra2 and other iterative designs.
> 
> I guess Percival felt there is an attack of this kind, that is why he
> introduced this subblock shuffling. The authors of Lyra2 did the same
> after they have seen our attack.This countermeasure helps a bit, but
> for small recomputation depth it is not good enough (it is my
> feeling), and it adds complexity.

Yes, now I can see how the shuffling may affect this.  Original scrypt's
sub-block shuffling appears to double the latency of your attack at each
level of the recomputation tree.  Correct?

I guess original scrypt's sub-block shuffling is defined in the exact
way it is in order to ensure that two successive applications of the
shuffling don't cancel each other.  However, I don't immediately see a
specific reason to have this requirement, and from the point of view of
maximizing latency in your attack, simple reverse order of sub-blocks
(e.g., "B_15 xor B_14", "B_14 xor B_13", etc. written to new B_0, B_1,
etc. instead of the current "B_15 xor B_0", "B_0 xor B_1", etc.) feels
like it'd work best.  I haven't thought this through yet.  I'd
appreciate your thoughts on the reverse order of sub-blocks idea.

> Our approach is to avoid this
> situation for all, requiring that there simply must not be any
> tradeoff attack on the compression function, or, in other words, the
> computational latency must be tradeoff-resistant. The Argon2
> parallelism that you have mentioned is somewhat unavoidable if we want
> this resistance.

Yes, I figured this out yesterday.

I think it's avoidable, but a possible reduction of parallelism just
does not occur with your approach naturally, and it'd cost a few extra
XORs or such to introduce it.  Looking at it from a different angle,
maybe this is your opportunity to easily introduce tunable
instruction-level parallelism (have a couple of flags dictate whether to
introduce or skip those extra data dependencies).

> The attack, apparently, also applies to yescrypt. I was not sure about
> that until you released the pseudo-code, but then I started thinking
> about storing the last subblock of each block to break the chain. For
> the full attack I would still have to plug yescrypt mode of operation
> (and its addressing) into our ranking method.
> 
> Your question prompted me to finish the attack, write a small
> description, and adapt the ranking algorithm to yescrypt. Here is a
> fresh draft:
> https://www.cryptolux.org/images/0/0f/Tradeoff_for_Yescrypt.pdf

I confirm your analysis.

I assume the C(q) and D(q) figures are currently inexact, for the reason
you mention ("have to plug yescrypt mode of operation (and its
addressing) into our ranking method"), but they look sane to me, and the
combination with your attack on BlockMix_pwxform looks correct to me.

I hope that yescrypt's D(q) will turn out to be slightly higher than
what you included here once you consider yescrypt's Wrap().  I guess
your current C(q) and D(q) assume modulo division in place of Wrap()?

You make two observations: about late inter-lane diffusion, and about
recovering missing blocks subblockwise.  It appears that you currently
only make use of the latter observation.  Do you think the former is of
any use in an attack?

> The consequences are better than for the original Lyra2 due to rather
> small number of subblocks and different addressing. Assuming unlimited
> parallelism, the time-memory product can be kept the same while the
> memory is reduced by the factor of 6.

Right.  At least that's what we arrive at with the current D(q) estimates.

I would also be interested in your results for t=1 (5/6 passes) and t=2
(2 passes).

>> However, the attack would appear to rely not only on the last
>> sub-blocks, but also on occasional full blocks, so I don't see how it'd
>> improve upon the higher-level TMTO attack alone.

I now see that it works by reducing the recomputation depth, not the
total recomputation, nor the memory, of a higher-level TMTO attack.

Thanks again,

Alexander