phc-discussions - Re: [PHC] Initial hashing function. Feedback welcome

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140105040638.GC6394@openwall.com>
Date: Sun, 5 Jan 2014 08:06:38 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Initial hashing function. Feedback welcome

On Tue, Dec 31, 2013 at 08:46:08AM -0500, Bill Cox wrote:
> I haven't added this to keystretch, but should we have another work
> parameter for how hard to push the CPUs?

Maybe.

> I liked your comment about running Salsa20/2, rather than Sala20/8.  Why
> not make the /X a CPU work parameter?

I thought of that, yes.  However, once we deviate from scrypt, I intend
to also deviate from the use of Salsa20 (alone) and towards a
Blowfish-like construction when GPU unfriendliness is desired.

> Without a CPU work parameter, I would lean towards using the fastest
> possible hashing algorithm with no SSE, since it needs to work on mobile
> devices.

We need to choose what we optimize for, and what we merely support
(suboptimally).  Supporting a wide variety of platforms optimally will
likely require flags to choose between different primitives.

That said, Salsa20 and ChaCha20 are pretty efficient on both desktop and
mobile CPUs.  Also, many mobile ARM CPUs support NEON.

Another curious option is Salsa6420 by @floodyberry.  I think it hasn't
been reviewed by anyone, but in terms of efficiency it appears to
perform better than Salsa20 and ChaCha20 when put into scrypt:

https://github.com/floodyberry/scrypt-jane

No multi-threaded benchmarks there, unfortunately.  (I wouldn't be
surprised if what's faster with 1 thread/core is slower than another
primitive at 2 threads/core.)

> I just used the default build on Arch Linux for scrypt. The
> files crypto_scrypt-nosse.c and crypto_scrypt-sse.c are identical,

Huh?!  No, they're different in scrypt-1.1.6.tgz.

> so I
> guess the only difference is compiler flags.  I assume SSE is on by
> default, but it doesn't look like there's been any hand optimization for
> SSE, but I'm not sure.

There are SSE2 intrinsics in crypto_scrypt-sse.c in scrypt-1.1.6.tgz.
When AVX is enabled with the compiler (e.g., with -march=native), these
are compiled to AVX instructions instead.

> > In my testing (with optimized scrypt code using AVX or better) reducing
> > Salsa20 rounds count from 8 to 2 (4x reduction in round count) results
> > in less than 2x speedup, which suggests that replacing Salsa20/N with
> > something way simpler is not going to make all that much difference
> > either.  I guess you may be seeing much more of a difference in part
> > because you're only benchmarking a single thread.
> 
> True, but for low CPU power situations like a mobile phone, that's an
> important case.  I would be interested in seeing both single and multiple
> thread performance of Salsa20/N, and with/without SSE or better.

So I posted some benchmarks.

> One reason for keeping random writes in L1 cache, other than speed, is to
> reduce information leakage about the key, just in case someone is able to
> detect cache miss timing.

Do you think random writes are more leaky than random reads?  Why?
I think these are about the same in terms of cache timing attacks.
So if you do random reads anyway, you won't defeat cache timing attacks
by avoiding random writes.

> > Another drawback is the very low probability that the last few V
> > elements written will ever be accessed.
> 
> That worries me, too.  I like your described approach of having a sliding
> window.  It's probably just better.  Would an attacker be able to free
> memory that slides out of the window and reuse it?

The attacker would be able to swap out that memory, but would need to
swap much of it back in (or access slower memory directly) upon reaching
the next power of 2.  Also, with SMix's second loop kept almost intact,
all of the memory locations, including in the previously
maybe-swapped-out portion, will be accessed by that loop with 63%
probability.

> > > - 2048 rounds of PBKDF2_SHA256 are used at the start to generate an
> > > intermediate derived key.
> >
> > I dislike this.  It's very wasteful in use cases where our total running
> > time is very limited.  This may be all running time we can afford (if at
> > all), not leaving any time for the memory-hard portion.
> 
> I may be having some basic disconnect here.  On my system, running
> 2,000,000 SHA-256 rounds using the code from scrypt takes 2.4s.  The worst
> performing desktop based Javascript benchmark I've read is 50,000
> rounds/second.  Even a low-end smartphone browser should be able to do 2048
> rounds in a small fraction of a second.
> 
> Why do we care about the first few milliseconds?

Because there exist real-world use cases where the total duration is
limited to a few milliseconds.  One of those is mass user
authentication (think e.g. a social network with millions of free
accounts), where the upgrade path must be reasonable and affordable
to the business.  The required throughput for first deployment is in
thousands of authentication requests per second per machine.  Another
use case is operating system defaults, where the OS is meant to run on a
wide variety of machines and architectures (e.g. for OpenBSD and NetBSD
this will range from VAX to modern machines), yet may accept having only
the same default settings for all (this means being limited to e.g. 1ms
on modern machines, in order to stay at e.g. 1s on VAX).  Also, there
are cases where any non-negligible increase in DoS susceptibility is not
acceptable, yet no measures such as rate limiting are being introduced
along with the move to slower password hashing.

NOELKDF is currently worse than bcrypt (at least as it relates to
attacks with GPUs) at such low running times, though - for reasons
unrelated to the initial PBKDF2.

> Am I missing something here?

Yes, other use cases.

> I suspect I've dropped a couple orders of magnitude somewhere...

No, you did not.

You're right in pointing out that even 2048+ iterations of SHA-256 is
not of much help, though.  So if we do such pre-hashing before starting
to fill memory for real, maybe we should do something along the lines of
bcrypt or bcrypt_pbkdf, so that we make use of the L1 cache:

http://www.tedunangst.com/flak/post/bcrypt-pbkdf
http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libutil/bcrypt_pbkdf.c

A drawback is added complexity.  Another drawback is deviating from
usage of a NIST-approved primitive, unless we do the SHA-256 as well.

> > > - The derived key does not depend on the number of threads

> > ... as long as the number of threads is not greater than a pre-specified
> > maximum.  I can claim this too. ;-)  And scrypt can claim it too,
> > although with scrypt there's a reduction in area-time from increasing p,
> > so it better not be done unnecessarily.
> 
> Good point!  For some reason I though scrypt's result changed with p.

It does change with p, but so does the result of NOELKDF with change of
your hard-coded max supported thread count.

> I was shocked to see that the API we have to implement for the password
> hashing competition passes the password as a "const void *" eliminating the
> possibility of clearing the password.

You may include that in your own API.

The PHC-mandated API allows for wrapping of the hashing scheme e.g. into
crypt(3) without requiring the wrapper to create a copy of the password.
Without the const, a crypt(3) wrapper would have to make a copy of the
password, because that API doesn't allow for an implementation to change
the supplied strings.

> Clearing the password shouldn't even be an option.  It should be mandatory.

Unfortunately, some existing APIs don't agree, and besides cleaning
anything from memory (and registers?) is tricky or/and unreliable.
If you do a memset(), a compiler can simply optimize it out.  This may
be better than not even trying, but not something you should put too
much trust into.

A way around this is to re-invoke the KDF with some fast to compute
inputs after computation of the actual output, hopefully overwriting
just the same memory locations and registers that it normally uses.
This also lets us perform a quick runtime self-test.  (I implemented
that in recent versions of crypt_blowfish.)  Even then, the kernel may
save the thread's registers further below on the stack when invoking a
signal handler, and it may in turn save registers even further down on
the stack when calling other functions.  A second invocation of the KDF
might not result in going as deep on the stack.  So some risk remains.

> > Also, for password hashing bcrypt is a better baseline, and you don't
> > reach it - as it relates to attacks with GPUs - by 2048 iterations of
> > PBKDF2-HMAC-SHA-256.
> 
> True, but we're following this with a memory-hard KDF.  I don't want to
> spend as much time as bcrypt takes.  I just want to start with good
> security and improve from there.

OK, but note that bcrypt provides some GPU unfriendliness even at the
same running time as your PBKDF2 (lower than bcrypt's normal running time).
There's some orders of magnitude difference between PBKDF2-HMAC-SHA-256
and bcrypt in terms of current GPU unfriendliness at same running time.

> I would be interested in how fast Salsa20/1 runs.

It is unclear if going below 2 rounds for Salsa20 makes sense and is
even easy to define unambiguously, due to the way Salsa20 is defined
(even and odd rounds are not the same).

Alexander