lists.openwall.net lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC Open Source and information security mailing list archives
Date: Mon, 06 Apr 2015 13:09:36 -0300
From: Marcos Simplicio <mjunior@...c.usp.br>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Compute time hardness (pwxform,blake,blamka)

Hi again.

Apologies for the delay: I had almost no Internet access this weekend.

On 03-Apr-15 08:12, Solar Designer wrote:
> Hi Marcos,
>
> Thank you for running these benchmarks.  They're very helpful.
>
> On Thu, Apr 02, 2015 at 06:22:38PM -0300, Marcos Simplicio wrote:
>> - For 1 thread ("p1"), when both yescrypt and Lyra2 have the exact same
>> number of passes through memory (Lyra2 with T=1 and yescrypt with T=2)
>> and use an underlying hash with similar compute-time hardening (12 MULs,
>> 12 ADDs and 12 XORs given by 1.5 BlaMka and pwxform) , then the lines
>> almost overlap;  so I believe we are near to a draw.
>
> This is nice to know.  It's kind of a sanity check for both designs.
>
> There's a difference, though: I guess yescrypt's pwxform S-box lookups
> are much more frequent than Lyra2's equivalent (is there one?)

No, there is no table look-up or anything of the sort in any Blake
variant I'm aware of.

If so,
> yescrypt actually performs more attack-discouraging work in the same
> amount of time here.  Can you compare Lyra2 and yescrypt in this respect?

Well, I must confess I did not study pwxform in detail, so I'm not sure
I can provide a fair comparison (everything I said about pwxform so far
was based on what you said about it in terms of number of MULs :) ).

OTOH, what I can say is actually a question: the S-box lookups are
attack-discouraging when you look at GPUs, right? At least that is what
I grasped from previous discussions on pwxform:

- Multiplications are to limit speed-ups on ASICs, since CPUs have

- Look-up tables (LUTs) are GPU-unfriendly operations, at least in
theory, but are in principle very fast in hardware.

So far we were discussing hardware-oriented protection, so LUTs would
not count much in the comparison. Of course, I may be missing something!

BTW: we are still trying to verify the above assumptions on GPU- and
FPGA-oriented unfriendliness with actual experiments. The GPU graphs
should be ready shortly (we have a GPU specialist working on that),
while FPGAs should take a little longer (a colleague professor allocated
a student to work on that).

>
> Also, I never understood why you chose two full passes through memory as
> the minimum for Lyra2.  Is this arbitrary?  For yescrypt, the 4/3 can be
> shown as being the optimal stopping point assuming no AT-reducing TMTO,
> so it's not arbitrary at all.

OK, that is our fault: that should be clearer on our documentation...
The rationale is a little long, so please bear with me.

In regular executions with large amounts of memory, memory latency and
bandwidth are expected to (and, experimentally, they do) play an
important role. The result is that using much faster computation
hardware in attacks may end up being of little use: it will not
accelerate the attack by much, just like increasing the number of rounds
10 times in a legitimate platform does not slow down the execution 10
times either...

Now, attackers could combine faster computation hardware with faster
memory technology, thus taking full advantage of the former to
accelerate attacks, in which case computation hardening indeed becomes
an important requirement. However, using fast memory technology for a
PHS with a high memory usage (e.g., 1GB of high-speed registers) is
probably unrealistic, as it would become very expensive to test a single
more reasonable to buy more (less expensive) hardware and get a higher
password testing throughput simply with parallel tests.

On the other hand, using fast memory is very reasonable for accelerating
operations that involve only a small amount of memory. This is exactly
what a legitimate user's cache is for, so maximizing cache usage on the
legitimate platform is (at least from this perspective) a good approach:
attackers should have at least the same amount of cache to get the same
speed-ups, while having much more cache only helps if the data to be
processed is in that part of memory (e.g., if the required data can be
prefetched, something reasonable in deterministic, cache-timing
resistant schemes/passes, or, again, if the whole memory is a huge cache).

There is, however, another situation that involves only small amounts of
memory and, thus, in which attackers can benefit from faster memory
technology: recomputations during a TMTO attack. After all, whether the
scheme/pass is deterministic or password dependent, recomputations
become password-independent as long as the attacker stores the sequence
of rows required after seeing them in original computation; those rows
can, thus, be prefetched and stored in cache-like memory. As a result,
while the original computation of a block V_i (discarded) from a set of
blocks \phi{V_i} took, say, 20\sigma (where \sigma accounts both for
memory- and processing-related costs), a hardware acceleration of 10x
combined with a 10x faster memory could bring the cost of recomputations
to approximately 1 if \sigma is 50% memory and 50% processing;
parallelism and pipelining, if possible, may allow recomputations to
occur even faster. The lesson here is that, even though at first sight
the time*memory cost would appear to the around 20, in practice it might
be much lower than that. That is why computing the actual time*memory
cost is tricky, especially without experiments to verify them, and why I
believe TMTO resistance should involve a reasonable security margin
(something that Argon and Lyra2, for example, try to achieve with their
minimum settings). It is more or less like Rijndael adopting 10 rounds
even though the Wide Trail Strategy upon which it was constructed would
suggest 8 rounds as being enough (given that the 2 extra rounds may end
up adding 5 active S-Boxes to attacks, only 10% of the 50 provided by 8
rounds).

This desire for a security margin is what motivated Lyra2's t_min=1,
which can be considered "arbitrary" but not "random": it was based on
the security analysis we tried to provide together with the algorithm's
documentation. Based on what we have seen as effective attacks, I
believe that (and, again, that is open to debate!):

1) Running Lyra2 with a 1/3 pass after the memory is filled (i.e., with
"T=1/3") is probably not as TMTO-resistant as yescrypt with similar
settings (i.e., for "T=0"). The reason is that Lyra2's first pass (Setup
phase) is cache-timing resistant and, thus, more vulnerable to TMTO
attacks, while yescrypt's first pass trades cache-timing resistance for
better TMTO resistance. The only caveat here, and the reason why I
prefer to say "probably not", is that the extra writes done by Lyra2
during Setup may compensate at some extent this lower TMTO resistance.
If that means that Lyra2 with "T=1/3" is comparable with yescrypt with
"T=0", however, I cannot tell without further studies (and, preferably,
experimental results!).

2) If Lyra2 adopted a first pass that was not cache-timing resistant
(i.e., if the "row^0" indices were picked in a password dependent
manner, just like in the Wandering phase), then it would probably be
able to run with a lower value of T than yescrypt and still be as TMTO
resistant. The reason here is that Lyra2's extra writes and reads are
more likely to lead to recomputations in a TMTO attack. For example, if
the memory has 1/2 of all rows required (assuming they are all updated):
there is only a 25% chance that both row^0 and row^1 will be in memory,
so recomputations will occur 75% of the time; if recomputing any row
takes from 0 to \alpha (or \alpha/2 on average), recomputing two rows
(i.e., 25% of the time) takes 3\alpha/2 on average; at least 2 rows are
updated per call to the underlying sponge, so "keeping 1/2 of all rows
required (assuming they are all updated)" does not mean an actual memory
usage of 1/2, but that plus intermediary states. Yescrypt does have
similar features, but it does not try to be as TMTO resistant. Hence, a
"Lyra2d" (borrowing Argon2's notation to indicate a non-cache resistant
variant) in principle would achieve TMTO resistance faster than
yescrypt. Then, it *might* be fair to compare, say, Lyra2 with 7/6
passes through memory ("T=1/6") versus yescrypt with 4/3 passes through
memory ("T=0").  Again, that does not necessarily balance the scale
toward Lyra2, as it may very well be wasting cycles with memory-related
operations trying to provide TMTO resistance, when it could be less
TMTO-resistant per hashing operation and simply do more hashes.
Nevertheless, this design choice was taken after we observed that Lyra2
could achieve high speed even with good TMTO resistance, which also
means exploring a resource that has lower probability to be explored
better (e.g., with faster technology) by attackers than by legitimate
users: memory.

Without further experimentation, however, I do not feel comfortable
speculating further on potential drawbacks of each approach, though. I
can only say that we do pay a price in terms of performance trying to
provide some cache-timing resistance, while it might be better to go
with Argon2's approach of separating things in "d" and "i" (explaining
the difference to users, however, may not be very easy).

Anyhow, I do not think t_min is critical for any scheme during the
competition itself: a better understanding on attacks may motivate
authors to set it higher or lower, and the final decision of what should
be the minimum is probably something that needs to be discussed and
sanctioned by the PHC panel together with the authors and the community
in general. SHA-3's capacity is an example of such after-competition
effort: although it is probably a bad example due to the debatable
choices NIST tried to push, the idea of "fixing some knobs" after a
scheme was selected is not bad per se. And that is also why comparisons
with a similar number of passes through memory are useful, even if not
definitive to say which scheme is "the best".

>
> I currently recommend that yescrypt be used with t=0 unless more time
> can be spent and more memory can't be in a given case.
>

That is reasonable. I'm just not sure what is the "security margin" in
practice (again, because computing time*memory cost is
technology-dependent). Just to avoid misunderstandings: no FUD intended
here! Unless someone can come up with an effective attack, there is no
reason to think yescrypt's assumptions are wrong, so t=0 may be
perfectly safe.

>> - For 1 thread ("p1") with different passes through memory: say,
>> yescrypt with T=0 does 4/3 passes though memory with 12 MUL+ADD+XOR,
>> while Lyra2 with T=1 and BlaMka makes 2 passes through memory with 8
>> MUL+ADD+XOR are comparable (?) in terms of compute-time latency (4/3*12
>> = 2*8 = 16). Lyra2 in this case is slightly (3%) slower. Given the low
>> difference, the reason is probably related to bandwidth usage, because
>> Lyra2 purposely uses more bandwidth in this scenario: namely, it makes 2
>> writes per call of the underlying hash instead of 1, and 4 reads (2 from
>> cache and 2 from memory) instead of 2. I'm not sure if that is good or
>> bad, but improvements on memory bandwidth are usually slower than on
>> computation power, so I would rather say it is good in terms of attack
>> resistance, as it limits an attacker's parallelism ability.
>
> This makes sense, but you also need to factor in yescrypt's S-box lookups.

I agree this factor is likely to play a role. However, Bill reported a
long time ago speed-ups of ~20% if Lyra2 reduced its number of memory
writes when no parallelism is employed. We preferred not do so in the
1-threaded version, however, because we _wanted_ attackers to run into
memory bounds as soon as possible (again, because users and attackers

Nevertheless, that discussion and further tests motivated a bandwidth
reduction for higher p, which justifies the better results for p > 1:
the extra cores provide both the higher bandwidth usage we wanted and
similar TMTO security margins. The side effects are, however:

1) It does become memory bound with degrees of parallelism higher than ~4

2) More parallelism allows better performance of attacks using GPUs
(e.g., the highest throughput of GPU-based attacks reported in the
Reference Guide is for p = 4, the maximum p we tested)

All in all, it is quite possible that Lyra2 is not as good at using the
whole budget of processing cores as schemes with a lower bandwidth
usage, unless further tweaking is done (outside the competition, of course).

>
>> - For 2 threads ("p2"), Lyra2 is faster for similar compute-time
>> hardening and also for higher compute-time hardening. That (and a very
>> old e-mail by Bill) is why we think bandwidth is likely to be at play
>> for p=1: we reduce the number of writes to 1 per thread during the
>> Wandering phase when more threads are active.
>>
>> - For 4 threads ("p4"), Lyra2 is again faster for similar compute-time
>> hardening, and also faster for higher compute-time hardening.
>
> OK.  I expect yescrypt's S-box lookups come into play here.  In fact,
> they also affect the 1 thread case, so I'm lucky yescrypt didn't appear
> to run slower there.
>
>
>
> What machine is this, for these benchmarks?

Sorry: I forgot to mention... It is a Intel Xeon
E5-2430, with 48 GB of DRAM, running Ubuntu 14.04
LTS 64 bits

(Note: this is the same computer used for the benchmarks in our
reference guide. The number of physical cores is 6, however, not 12 as
reported there: 12 is the number of threads)

> Would you also benchmark Lyra2 vs. yescrypt for the maximum number of

Sure! Attached are the graphs for 1-14 threads (more than 12, just to
see what happens).

As mentioned above, Lyra2 gets memory bound after ~4 cores, so there is
little difference in performance between Blake2 or BlaMka with a
different number of rounds.

Yescrypt appears to scale better with more cores, apparently because it
does not get (much) memory-bound. We notice that the performance gains
seem to be more expressive up to the point the maximum number of
physical cores is reached, although it may be a coincidence.

One question here is: does yescrypt design tries to make attackers
memory-bound too? I mean, it clearly takes better advantage of the
"processing cores" budget, but cores are something that attackers are
likely to have access more easily than defenders, while "better memory
technology" not so much. This logic may be completely wrong (especially
without experimental results to back it up), but, for better or for
worse, that is something we considered when designing Lyra2.

> Are both Lyra2 and yescrypt
> built for the same SIMD intrinsics (and which)?

Not really: Lyra2 uses SSE2 ("-msse2" flag), while yescrypt was built
with the default makefile, with its "-march=native" flag. So, in the
words of my student who run the benchmarks, "yescrypt is running SSE4.1
and/or AVX".

I will have to check in more detail to avoid saying anything wrong, but
it appears we are running yescrypt with more help from SIMD instructions
than Lyra2...

BTW, we have not yet used the suggestion by Samuel Neves regarding NORX
tricks ("We used this trick (replacing + with ^, * with &) in NORX (pg.
23 of https://norx.io/data/norx.pdf); it does noticeably
improve latency."). We will try and see what happens.

>
> There's significant room for improvement for yescrypt with AVX2, which I
> haven't added yet.  In fact, on pre-AVX2 CPUs the compute hardening can
> be doubled by reducing the parallelism from 512-bit to 256-bit, with
> only a moderate slowdown, so the hardening per defensive running time
> would likely be improved by 50% or so - but I preferred the defaults to
> be balanced, with newer and near future CPUs in consideration as well,
> and also with CPU attacks considered (where an attacker could bring the
> extra parallelism in from having multiple candidate passwords to test).

That sounds reasonable. One caveat, though, is that if we really are
going to take into account more limited platforms (e.g., 8- or 16-bit
microcontrollers) as a possible use case, then fixing the underlying
hash to pwxform may not be the best approach (possibly a "pwxformLite"
would be necessary).

>
> Thanks again,
>
> Alexander
>