phc-discussions - Re: [PHC] A review per day

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <540CBA61.7080904@ciphershed.org>
Date: Sun, 07 Sep 2014 16:04:49 -0400
From: Bill Cox <waywardgeek@...hershed.org>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] A review per day - Lyra2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/07/2014 11:05 AM, Solar Designer wrote:
> Bill,
> 
> Thank you for comparing these three candidates.
> 
> On Sat, Sep 06, 2014 at 10:44:21AM -0400, Bill Cox wrote:
>> Yescrypt wins at GPU defense, due to it's rapid unpredictable 
>> small memory reads from L1 cache.  Like bcrypt, Yescrypt issues 
>> multiple unpredictable small memory reads in parallel, to get to 
>> about 1/2 the number that bcrypt does
> 
> Where did you get the specific "1/2" figure from?  Does it
> possibly come from that posting on "escrypt 0.3.1"? - 
> http://thread.gmane.org/gmane.comp.security.phc/959/focus=1009 If 
> so, please note that current yescrypt, as submitted to PHC, has 
> changed in this respect since then, to match bcrypt more closely
> in this respect.  In that posting, I specifically said that those 
> initial benchmark results vs. bcrypt demonstrated that more work 
> was needed.

I did use old post data.  Perhaps I should read the papers or go back
and re-read your code :-)  However, it's easier just to ask you.

How fast is the latest Yescrypt vs bcrypt at small random reads?  I
estimated you were doing 3X more than TwoCats when maxed-out, though
this depends on the memory in use.  For bandwidth limited external
hashes, running Yescrypt with t_cost == 0, Yescrypt would have about a
thread per CPU, while TwoCats would only be running on about half the
CPUs.  When running very small hashes (8KiB), how fast does Yescrypt
do unpredictable reads?

>> when single-threaded.
> 
> What do you mean by that?
> 
> The comparison vs. bcrypt only makes sense when we're exhausting 
> all CPU cores by both bcrypt and the PHC candidate being tested. 
> This means running as many concurrent instances (yes, 
> single-threaded ones) as there are logical CPUs in the system (so 
> multiple threads total).

I tried to compare when running 1 thread per CPU.  However, I could
use your help with the numbers.

> Purely single-threaded comparisons are not relevant, neither for 
> the defender (where the bcrypt cost setting and the PHC
> candidate's cost settings will be limited by server capacity), nor
> for the attacker (who will typically want to use their resources
> fully at all times).

Agreed.  Mostly, anyway.  Single threaded defense is an important
case, which I listed in the table.  I meant to list by order of
importance, and this is not high on the list.

>> TwoCats only issues one unpredictable read at a time, and 
>> performs them at about 1/3 the number as bcrypt when single 
>> threaded.  However, Yescrypt is designed to max out everything
>> at the same time, including CPU cores, while TwoCats runs into 
>> memory bottlenecks on my test machine using only half the CPU 
>> cores.  This gives Yescrypt a 3X-ish advantage over TwoCats in 
>> unpredictable read rate, which likely translates into better GPU 
>> defense.
> 
> Oh, OK.  So you do acknowledge the single-threaded scenario isn't 
> one to consider in this context.

Yes.

>> Lyra2 does small-ish unpredictable reads,
> 
> Where?  I am not seeing that.  I only see 3 nested loops: Time Loop
> until T (corresponds to yescrypt's t), Visitation Loop for rows
> (corresponds to (ye)scrypt SMix second loop), and Columns Loop
> (corresponds to (ye)scrypt's BlockMix).  There's nothing that would
> correspond to yescrypt's pwxform, as far as I can see. Setting C
> low would be equivalent to setting scrypt's r low - that is, it
> would bring things closer to a Pufferfish-alike, but not close
> enough for this to make sense (so memory access would become slow,
> but not much GPU defense would be achieved).

Whoa!  I totally missed that!  I did not re-read Lyra2's code well
enough this time.  I thought I remembered it well enough from last
Spring.  The loop for each row is a simple linear pass, with no small
random reads at all.

This is worse GPU defense than I thought.

>> ASIC defense for large memory hashing is excellent for all three,
>> assuming Lyra2 becomes multi-threaded.  An ASIC attacker will
>> quickly become memory bandwidth limited.  Practical ASIC attacks
>> have at most about a 32 times higher memory bandwidth than a
>> modern high-end CPU. This enables an attacker to run up to 32
>> times faster, but the expense per ASIC of such a system is likely
>> similar to the cost of a desktop PC.  TwoCats and Yescrypt have
>> multiplication chain compute-time hardening, which in some cases
>> will limit an attacker's ability to hash at the full speed 
>> supported by his memory bandwidth.  In that case, Yescrypt and 
>> TwoCats will force an attacker to either make fewer guesses per 
>> second, or buy more memory so that more attacks can be 
>> interleaved, filling up memory bandwidth.  Yescrypt and TwoCats 
>> win over Lyra2 because of this.  TwoCat's multiplication chains 
>> are tunable, allowing it to tune the multiplication chain
>> lengths to the hashing speed of the machine.  This allows TwoCats
>> to be more multiplication chain hardened than Yescrypt in some
>> cases. The extra area on the ASIC needed by Lyra2 and Yescrypt 
>> computation logic vs TwoCats is tiny, as there would only be 32 
>> copies at most.  TwoCats has a slight win here over Yescrypt.
> 
> I intend to look into making yescrypt's pwxform rounds count 
> tunable, e.g. maybe supporting 2, 6, 10, 14 rounds (two flag
> bits). (The current hard-coded default is 6.)  In semi-optimized
> code, this may be 2 rounds followed by a loop with 4 rounds in it,
> or vice versa, or it may be a loop with 4 rounds in it and an exit 
> from the middle of the loop's body. 6 is what I find optimal on 
> current servers for user authentication, 2 may provide better 
> single-threaded performance (but lower multiplication latency 
> hardening), closer to TwoCats'.  10 and 14 may provide higher 
> multiplication latency hardening.
> 
> Do you feel this would be worth the extra complexity?

I am not sure.  Complexity is already Yescrypt's biggest weakness,
IMO.  On the other hand, lower depth might also be good for basic CPUs
with no SIMD support.

Honestly, Yescrypt already wins in most of the most important
categories, and loses only slightly in most of the others.  I don't
see any significant weaknesses until we get down to Basic CPU support
and cache-timing defense.  IMO, it's already the winning entry.

However, for TrueCrpyt hashing, I don't care much about cache timing.
 I think I care moure about lower-end CPUs.  Low hashing rounds would
help basic CPUs, wouldn't it?

>> It may be common for some groups of users to run single-threaded,
>> though I recommend against it in all cases except for single-core
>> CPUs.  In that case, TwoCats wins in ASIC defense simply because
>> it will fill more memory than either Lyra2 or Yescrypt.  A single
>> TwoCats thread fills external memory over twice the speed of
>> either Yescrypt or Lyra2.  When no SIMD unit is available, that
>> ratio might increases by a factor of about 4, corresponding to
>> the pipeline depth used in the SIMD units by Yescrypt and Lyra2.
> 
>> While all three have no requirement for special hardware, all 
>> three are tuned for modern x86 SIMD units.  TwoCats I think wins 
>> in adaptability to basic CPUs.  This is because it can run with
>> a single 32-bit width with far fewer instructions per byte of 
>> memory hashed. Yescrypt also has variable width SIMD loops, but 
>> it's pipeline depth is not currently variable, making it slow 
>> down somewhat on a basic CPU.  TwoCats wins here because it uses 
>> less pipelined parallelism.
> 
> Yes, I need to make yescrypt's instruction and SIMD parallelism 
> runtime tunable, supporting a handful of settings.

Maybe... At some point you may need to ask the opinions of the other
judges.  Yescrypt is already a defensive wall with more layers than
any other entry by a wide margin.  At some point it becomes overkill.

What I'm wondering is whether a basic non-SIMD single-core CPU would
use any of these entries with well tuned parameters, or if they would
just run with defaults.  My guess is they would set t_cost to 0 or 1,
and lower m_cost until runtime were OK.  Most app designers would not
go to the extended API and tune the parallelism to their CPU.
However, for those that do, I think they would like both SIMD width
and pipeline depth control.

Frankly, you have answers for these issues, but it's not clear you
need to deal with them.  I think you're algorithm is already the clear
winner.  Just take a look at my spread sheet.

>> Lyra2 easily wins at TMTO defense.  Both Lyra2 and Yescrypt use 
>> increasing t_cost to improve TMTO defense, but Lyra2's decision 
>> to continue overwriting memory while reading I think makes it 
>> more TMTO resistant.
> 
> yescrypt also continues overwriting memory while reading, except 
> when run with high p (which Lyra2 doesn't currently support) and 
> the YESCRYPT_PARALLEL_SMIX flag set.  When p > 1 and 
> YESCRYPT_PARALLEL_SMIX is set, yescrypt avoids writes during the 
> portion of its running time where all threads share the entire V. 
> Even in that case, increasing t also increases the number of loop 
> iterations during which writes do happen (although this number is 
> relatively small when p is high).

That's right... I forgot.  My head has been crammed with a lot of
rewviews since Yescrypt!  Certainly with t_cost == 1, Lyra2 wins at
TMTO, because it does 9 read/writes per location.  At the same time,
it gives up a lot of time*memory defense.  I feel Lyra2 does TMTO
well, but at the expense of overall lower security.

In this comparison, I mostly assume t_cost == 0 for Yescrypt, which
makes it's memory access very closer to TwoCats, simplifying the
comparison.

>> TwoCats as submitted uses t_cost poorly.  A revised version 
>> exists that will be submitted if allowed, which does increase 
>> TMTO resistance as t_cost increases, but this feature works 
>> better in Lyra2 and Yescrypt.  However, because TwoCats was 
>> designed as a single pass algorithm, it has the best TMTO 
>> resistance when t_cost turned off, meaning only one memory pass. 
>> This is not a mode Lyra2 allows, but Yescrypt does, and both 
>> Yescrypt and TwoCats achieve higher memory*time defense as a 
>> result.
> 
> yescrypt doesn't support fully skipping its SMix2.  t=0
> corresponds to spending 1/4 of yescrypt's total running time in
> SMix2.

Which sounds great to me.

>> As for time*memory defense, Yescrypt, when run with t_cost == 0, 
>> has the same characteristics as TwoCats, filling memory
>> linearly, never stopping until enough memory has been hashed.
> 
> No.  See above.

Similar, though.  Close enough that I don't feel like updating my
spread sheet.  TwoCats would wind up winning on time*memory defense at
t_cost == 0 by how much, assuming external memory bandwidth limited in
both cases?  You do a read/write pass, very similar to TwoCats, and
that's 2 memory accesses, not counting allocation overhead.  Then you
either read or write but not both for 1/4 memory.  That's at most
1/8th of the memory bandwidth.  I can give TwoCats an 8/7 advantage on
time*memory defense here, but only assuming total memory count.  For
average, the difference is to little to worry about.

>> This maximizes time*memory defense vs any other possible 
>> solution.
> 
> It does, but there are also concerns on the (non-)uniformity of 
> memory accesses, which may enable attacker's dollar cost 
> optimizations via use of different memory types.  You partially 
> deal with this (and with TMTO) via your "cubed distribution". 
> yescrypt deals with it via sliding power of 2 window, but this 
> moves the optimal stopping point for its scrypt SMix-like
> algorithm from N (just first loop) to 4/3*N (1/4th of total time
> spent in SMix's second loop).  We discussed this in here some
> months ago.

Maybe, but I don't known what the different memory types would look
like.  My memory access is weighted heavily towards the most recent
memory written, so an attacker might be able to move early memory to
some slower, cheaper memory.  However, can't figure out any real
memory that I could buy to build such a system.  We both already
pretty much max out the memory bandwidth.  There's not enough time to
stream to disk, even with an SSD.  However, with my broken t_cost set
very high, an SSD could work well for an attacker.

>> Lyra2 does at least two passes over memory, doing twice as many 
>> read/write operations per memory location.  This causes it to
>> use half the memory as TwoCats and Yescrypt when memory
>> bandwidth limited.  All of my ASIC attack plans so far would need
>> only half the memory to attack Lyra2 hashes, lowering it's
>> time*memory defense by 2X.  However, a more sophisticated attack
>> might be able to take advantage of the unfilled memory until it
>> is needed, lowering the time*memory penalty to only a 3/4ths of
>> TwoCats and Yescrypt.
> 
> For TwoCats, average usage is 1/2 of peak.  I guess you derive 3/4 
> of TwoCats' for Lyra2 given that Lyra2's average is also naturally 
> somewhat lower than its peak usage, just not to the same extent.
> 
> For yescrypt t=0 in native mode, average usage is 5/8 of peak.
> 
> Alexander

I went back and looked at Lyra2's memory access again.  With t_cost ==
1 (minimum), it does a total of 9 passes, plus 1 for zeroing if not
running as an authentication server.  It fills to max memory while
doing 2 reads and 2 writes, and then does a pass with 3 reads and 2
writes.

In these benchmarks the zeroing counts, so both loops should run about
the same speed.  So, it fills for 1/2 the time, and then rehashes for
1/2 the time.  It uses 1/2 the peak memory that it would if it just
kept on filling, but the average is 3/4ths as high as it would be if
it kept filling.

The real killer for Lyra2, in addition to poor small-memory GPU
defense, is the lack of any good mode for running with fewer memory
passes.  TwoCats runs with the fewest, and Yescrypt is only slightly
higher at min t_cost.  Overall, I think you made a better design, but
improved TMTO is not why.  I think your improved GPU defense is the
main improvement, followed by better authentication server support,
and then compatibility with Scrypt.  That, plus your code is less
buggy and closer to ready for prime time.  I still like my hybrid
architecture with cache-timing defense, though :-)

Bill
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUDLpeAAoJEAcQZQdOpZUZn3wQAJ7CVukiMrujA6K7jzWEmlx3
JHLZ8qdmKiphzcA83EC9+Bavbu5bEfafNLXA8p++eGlUfs5ZGlYKfN3/ywK1eW1s
UpLylujysYAQmP6fv+lodfsvLpeGYTuglpp3MgtfWhnx6kwm5AKcbesGbf1NmTuA
3L5tUIbN8+kX446obM7rLr0qkewTFIQxnYAaM1CQNa2kD0AxgWzPlf3XbBuFmUEy
st8sEAI52Zt8vY75XSJDeZL89nUmdKuZCIpYS5pT4SoIPS6jmH5qIAuCamGF+3K+
oshqbWf7M/oNDrr747n09c9kagEoPJJ5/WFnKHsiBIqqa8siAckHyv7lnYgA4Hst
FggtqUmUxxW3hhRbH4+rRgVdyR/TuRGpriv5hIN2V2ZAcVApBSJOUZIGFF3PekYv
kltfHICg64iHdK0MSo9H6TBhizxeEA5IbvsVbC0MbkkHewi38jlq3hTTC+YvMw0L
zwqLazYuvyzkUx3wDR79Ijq3ui4Ms4RJhgqu28t7J9Ytcwp1gJMRgdp9tEwnu718
iGkhP3mrQ5xZdOGQrLZGH/ePlk5l54XOhgSkfWyhyDN1b8p4M+5xlU/hpbkrwvG0
X61dNuLLt1STT/kTBAoyUQYTjj+pHjqD3Y7sHYlzytbcwUYYaN/Ra8Xgny1Hd3KZ
g2pY4RKMjib4Ez5mgA7f
=+Dq6
-----END PGP SIGNATURE-----