phc-discussions - Re: [PHC] Reworked KDF available on github for feedback: NOELKDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 4 Jan 2014 04:49:49 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Reworked KDF available on github for feedback: NOELKDF

Thanks again for the detailed reply.  All this data is very helpful!  One
data point I wouldn't mind seeing: what is the runtime of noelkdf with 1
thread?

On Sat, Jan 4, 2014 at 12:27 AM, Solar Designer <solar@...nwall.com> wrote:

> On Fri, Jan 03, 2014 at 03:12:40PM -0500, Bill Cox wrote:
> > I gave it a name this time.  It's based on the experience so far from
> > keystretch, but is simpler, and even faster.  It doesn't use a "key" to
> > hash pages, but instead treats the previously generated "page" as the key
> > for generating the next one by hashing it with a random previous page.
>  On
> > my Corei7 server, it runs about 15% slower than memmove (which is faster
> > than memcpy because 1/2 the memory is allocated).  2GB of hashing takes
> > 0.35 seconds.  Both are multi-threaded 2-way, which is the fastest case
> on
> > both machines.
>
> This is very nice speed.  Somehow I get a worse speed when testing on
> FX-8120 (stock clocks) with 2x DDR3-1600 memory sticks, running Linux:
>
> $ ./run_noelkdf
> 324C977EDEF4972925261D6B996C904E1EF297BC0EFF4286887D187CA16326EE
>
> real    0m1.148s
> user    0m0.912s
> sys     0m1.180s
>

I'm running a home-built system I made for my son's birthday.  It's fast,
but not the fastest  He runs Minecraft servers on them, which makes
reliable benchmarking difficult.  Here's the parts I bought from Newegg to
build it:

CORSAIR Vengeance 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 2800) Desktop
Memory Model CMZ8GX3M2A1600C9B

Intel Core i7-3770 Ivy Bridge 3.4GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core
Desktop Processor Intel HD Graphics 4000 BX80637I73770

GIGABYTE GA-B75M-D3H LGA 1155 Intel B75 HDMI SATA 6Gb/s USB 3.0 Micro ATX
Intel Motherboard

Here's the runtime with git
commit 00cef314195850dc7829dd684051a3646ca2cf93.  I modified some docs most
likely since your benchmark - nothing that changed performance.  Here's
three runs I just did:

noelkdf> ./run_noelkdf
324C977EDEF4972925261D6B996C904E1EF297BC0EFF4286887D187CA16326EE

real    0m0.311s
user    0m0.560s
sys     0m0.040s
noelkdf> ./run_noelkdf
324C977EDEF4972925261D6B996C904E1EF297BC0EFF4286887D187CA16326EE

real    0m0.315s
user    0m0.480s
sys     0m0.130s
noelkdf> ./run_noelkdf
324C977EDEF4972925261D6B996C904E1EF297BC0EFF4286887D187CA16326EE

real    0m0.305s
user    0m0.380s
sys     0m0.210s

I like to benchmark this against a simple memmove which reads 2GB and moves
it 8 bytes.  I'm only reading 2GB once, and writing it once, which is
pretty close to memove.  I assume the previous "page" stays in L1 cache, so
it's slowing noelkdf down, but not much.  Here's memcopy runtimes just now
on the same machine:

noelkdf> time ./memorycpy
0

real    0m0.186s
user    0m0.130s
sys     0m0.230s
noelkdf> !!
time ./memorycpy
0

real    0m0.186s
user    0m0.070s
sys     0m0.290s
noelkdf> !!
time ./memorycpy
0

real    0m0.185s
user    0m0.060s
sys     0m0.300s

I was wrong about only being 15% slower than memcopy.  For some reason,
when I measured it before the post, memcopy ran slower.  My guess is that
my son's friends were tying up some CPUs and Linux decided to run both
threads of memcopy on the same CPU.  Noelkdf is running about 65% slower
than a call to memmove for 2GB.


> I kept NUM_THREADS at 2.  Does your 0.35 seconds possibly not include
> the memory allocation overhead?  How do you measure it?  Would it be
> e.g. 0.912/2 = ~0.46 seconds (of real time presumably for the hashing
> portion only) from the test above?  If so, that is not a very reliable
> way to measure it, I'd say.
>

It's the whole run, just like you did, but I run it 3 times and take the
lowest.  Often the first run is quite a bit slower as memory gets moved
around.


> 4 threads:
>
> $ ./run_noelkdf
> 4D6C717AB933FBE65BE4A61F8CECC4BA2FCE4151752CD372F8B23923B8C98C73
>
> real    0m0.801s
> user    0m1.024s
> sys     0m1.616s
>
> 8 threads:
>
> $ ./run_noelkdf
> 83D88DFC80995C63A9807583A3D56DBD88CE79757E382A56149C9E1BE07BD716
>
> real    0m0.666s
> user    0m1.620s
> sys     0m2.416s
>

Thanks for the extra data.  I think I'm running into memory bandwidth
bottlenecks and I think my CPU is doing a lot of instructions in parallel
already, which accounts for the 2-thread optimum on my machines.  I'll have
to make the number of threads an option like I did for keystretch.


> To put it in perspective, here are some benchmark results for our
> current escrypt tree on the same machine.  Most of these are for a
> XOP-enabled build, except where stated otherwise.
>
> First, in scrypt backwards compatibility mode, we get the scrypt test
> vectors for up to 1 GB (most time is spent on the 1 GB, indeed):
>
> $ time ./tests
> scrypt("", "", 16, 1, 1) = 77 d6 57 62 38 65 7b 20 3b 19 ca 42 c1 8a 04 97
> f1 6b 48 44 e3 07 4a e8 df df fa 3f ed e2 14 42 fc d0 06 9d ed 09 48 f8 32
> 6a 75 3a 0f c8 1f 17 e8 d3 e0 fb 2e 0d 36 28 cf 35 e2 0c 38 d1 89 06
> scrypt("password", "NaCl", 1024, 8, 16) = fd ba be 1c 9d 34 72 00 78 56 e7
> 19 0d 01 e9 fe 7c 6a d7 cb c8 23 78 30 e7 73 76 63 4b 37 31 62 2e af 30 d9
> 2e 22 a3 88 6f f1 09 27 9d 98 30 da c7 27 af b9 4a 83 ee 6d 83 60 cb df a2
> cc 06 40
> scrypt("pleaseletmein", "SodiumChloride", 16384, 8, 1) = 70 23 bd cb 3a fd
> 73 48 46 1c 06 cd 81 fd 38 eb fd a8 fb ba 90 4f 8e 3e a9 b5 43 f6 54 5d a1
> f2 d5 43 29 55 61 3f 0f cf 62 d4 97 05 24 2a 9a f9 e6 1e 85 dc 0d 65 1e 40
> df cf 01 7b 45 57 58 87
> scrypt("pleaseletmein", "SodiumChloride", 1048576, 8, 1) = 21 01 cb 9b 6a
> 51 1a ae ad db be 09 cf 70 f8 81 ec 56 8d 57 4a 2f fd 4d ab e5 ee 98 20 ad
> aa 47 8e 56 fd 8f 4b a5 d0 9f fa 1c 6d 92 7c 40 f4 c3 37 30 40 49 e8 a9 52
> fb cb f4 5c 6f a7 7a 41 a4
>
> real    0m2.451s
> user    0m1.872s
> sys     0m0.572s
>
> This is 1.9 seconds user, 2.5 seconds total running time.  1 thread
> (even though one of the test vectors has p > 1, I did not bother
> enabling threads for this test, which focuses on the 1 GB anyway).
>

That's a nice improvement over regular scrypt.  On my machine, with a
slightly modified (to print the time) version of the Arch Linux scrypt
package:

scrypt-1.1.6> ./scrypt enc foo
Please enter passphrase:
Please confirm passphrase:
N:524288 r:8 p:1
1.860979

On my system, I do 2GB in .32 seconds, while scrypt does 0.5GM in 1.86
seconds.  I'm running 23X faster than regular scrypt, though with 2
threads.  Noelkdf takes .45 seconds with 1 thread, which is still 16X
faster.  If you're doing 1GB in 2.5 seconds on 1 thread vs what I'm just
guessing might be 1.6 seconds for 2GB of noelkdf,   That's only a factor of
3X, so I'm inferring you've sped up scrypt by as much as 5X.

The scrypt paper gives 3.8 seconds for 1 GB on Core 2 Duo (running only
> one thread too).  It does not mention whether the memory allocation
> overhead is included or not.  We appear to have a ~2x improvement since
> then, due to faster CPU and memory, and code optimizations (including
> use of XOP on this CPU).
>
> Not that in a way scrypt's 1 GB corresponds to NOELKDF's 2 GB, because
> scrypt runs for 2*N time.  In terms of cost to attacker, NOELKDF's 2 GB
> is _presumably_ better than scrypt's 1 GB.
>

Why do you say scrypt's 1GB is the same as noelkdf's 2GB?  For 1G, both
read 1GB from memory, and write 1GB to memory.  Noelkdf just does it in 1
loop instead of 2.

Now let's try escrypt in one of its native modes, with random lookups
> added to SMix's first loop and writes added to SMix's second loop.  The
> running time increases slightly:
>
> real    0m2.581s
> user    0m2.080s
> sys     0m0.492s
>
> I think this is a very acceptable cost for the added TMTO resistance,
> when that is desired.
>

I remain a fan of your TMTO resistance.  I have not matched it.  I tried
with the spinlocks and threads working in sync on small blocks, but as you
helped me figure out, the threads were stomping on each other's pages, and
I lost any benefit.  Currently, I just run multiple threads the way scrypt
does, with each having it's own separate memory, allowing an attacker to
run them in series in less memory or in parallel with more memory.

With the random lookups in first loop, this is a more direct comparison
> to NOELKDF (which also does those).
>
> Now let's expand to 2 GB to make sure we're no weaker than NOELKDF.  (In
> fact, we should be 2x+ stronger than NOELKDF in that mode, because the
> iteration count is 2*N, just like in scrypt, and additionally the second
> loop's distribution of random accesses is uniform, unlike NOELKDF's, and
> there are random writes.)
>
> real    0m5.155s
> user    0m4.120s
> sys     0m1.024s
>
> OK, 5 seconds (or 4 seconds without the memory allocation overhead) is
> quite a lot, but that was for one thread, whereas NOELKDF's performance
> is for 2 threads (and more threads don't help NOELKDF, as Bill says).
>

It looks like more threads help on your machine, just not on mine.  I think
that's just a CPU/memory thing.  My CPU seems to be doing a lot of parallel
execution already, which is making more than 2 threads lose.  Again, I
don't see any additional memory bandwidth in scrypt vs noelkdf per GB.

Here's a nice comparison of the two processors:

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2

If I'm reading it right, your AMD FX 8 processor has a limit of 8
instructions in parallel, while the Bulldozer Core i7 can do 16.  By
itself, that could explain the runtime differences we're seeing for
noelkdf.  It also would explain why I get little benefit from more than 2
threads.

Let's try multiple threads.  Note that this disables the random writes
> in SMix second loop (so is more similar to NOELKDF in that respect).
>
> 2 threads:
>
> real    0m2.841s
> user    0m4.488s
> sys     0m1.160s
>
> 4 threads:
>
> real    0m1.700s
> user    0m4.968s
> sys     0m1.400s
>
> 8 threads:
>
> real    0m1.112s
> user    0m5.608s
> sys     0m2.144s
>
> With this, we got to ~3x the running time of the 0.35s claimed for
> NOELKDF, but this is for 2x+ area-time cost to attacker and including
> memory allocation overhead time.  It is at 4x the thread count, but if
> NOELKDF doesn't benefit from more than 2 threads, then this is a fair
> per CPU chip speed comparison (relevant e.g. for use on one's own
> computer for generating a filesystem encryption key).
>

It's probably better to compare to your 8-thread run of noelkdf on your
machine, which ran in .666 seconds.  With 2 threads, noelkdf seems to be
running about 2.5X faster, but at 8 threads, the difference drops to about
1.7X.  My guess is that the times are converging because memory allocation
is a constant time for both, and we're running into a memory bottleneck.
 I'm pretty sure we read/write the same amount per GB.

We're also at less than 2x higher running time compared to 0.666s, which
> was best for NOELKDF on this machine (8 threads, too).  So escrypt wins
> for security per time. ;-)  But I guess that optimizations for NOELKDF
> are possible (including some without breaking compatibility with the way
> it's currently defined).

[ Note that AMD's FX-8120 is for performance comparisons similar to Intel's
> quad-core CPUs with HT.  AMD's "modules", which this CPU has 4 of, provide
> similar amount of resources per hardware thread as Intel's "cores" do. ]
>

Except for instruction parallelism.  I tuned that inner loop until it
screamed on my machine while doing as much hashing as possible without
slowing down, so it's no surprise that it doesn't scream on other machines.
 I'd love to see if I can get it running in .3-ish seconds on your  machine
by simplifying the hash function even more.  However, SIMD probably will
get it there.


> Salsa20 rounds reduced from 8 to 4:
>
> real    0m0.926s
> user    0m3.988s
> sys     0m2.272s
>
> Salsa20 rounds reduced to 2:
>
> real    0m0.854s
> user    0m3.136s
> sys     0m2.536s
>
> This is some significant speedup (less than 2x, though) in terms of user
> time, but less of a speedup in terms of total running time (with the
> overhead included).
>

I think this seems to imply approaching a memory bottleneck.  At rounds ==
1/4, I'm betting you run in 0.666 seconds :-)  Given all the work Smix
does, this is impressive optimization!


> Let's try to compute the hash 10 times per program invocation, to move
> the memory allocation overhead out of the loop (is only done once here):
>
> real    0m6.159s
> user    0m45.363s
> sys     0m2.584s
>
> Now it's down to ~0.6s per hash, which is better than 0.35s claimed for
> NOELKDF, given escrypt's 2x iteration count (and thus cost to attacker).
>

Yep!  I thought we were on track to around that.  I'll do the same with 2
threads:

./run_noelkdf
B9B9D5EFC1001498ADDB5893FD9A359A72CA1845335D8C6825FC42D301006AD6

real    0m3.404s
user    0m6.700s
sys     0m0.090s

It looks like I'm only getting .34 seconds/iteration, which is worse than I
was getting earlier.  I think my server has mood swings, probably related
to temperature.  By the time it runs 2 cores that long, it's already
scaling back the clock.  This could also partially explain why my machine
hates more than 2 threads.  However, for that short burst, those two
threads scream.

If you like, I'll run escrypt on my machine for you to see if you see
similar speedups.


> Let's revert to 1 GB allocation, to have same iteration count as NOELKDF
> (but up to 2x lower area-time):
>

I see no 2x difference.  Still, more data is good.


> real    0m3.041s
> user    0m22.309s
> sys     0m1.316s
>
> 0.3s per hash now.
>
> Salsa20 round count reverted to the default of 8:
>
> real    0m3.985s
> user    0m30.038s
> sys     0m1.040s
>
> 0.4s per hash, which is only slightly worse than NOELKDF's 0.35s at same
> iteration count.
>

I wouldn't worry about the numbers I get on my machine.  I've simply got a
different CPU and memory setup.  Your setup seems optimized for multiple
threads, while mine likes one core going fast while the others are idle.
 The numbers you got for noelkdf on your machine are fair, though I'd like
to try out some tricks to improve it on your machine's architecture.


> An SSE2 only build (no XOP):
>
> real    0m4.976s
> user    0m38.098s
> sys     0m0.944s
>
> 0.5s per hash now.
>
> -nosse build, without threads (not implemented in that version of the
> code yet) and without many possible optimizations (ditto):
>
> real    0m45.829s
> user    0m45.239s
> sys     0m0.568s
>
> Now it's 4.5s per hash.  Ouch.  It is comparable to NOELKDF in that the
> code doesn't use SIMD (unfortunately, it also doesn't use threads yet,
> unlike NOELKDF).
>

Just set NUM_THREADS to 1.  I'll make that an option on the command line
tomorrow.


> Summary: NOELKDF may be a lot faster than escrypt on SIMD-less builds,
> but might be almost same speed when benchmarked against SIMD-enabled
> builds.  The question is then how SIMD'able NOELKDF is, and how fast
> those builds will run.
>
> Alexander
>

I think you're assuming a 2X that isn't there, but that's a separate issue.
 Noelkdf is highly SIMD-able.  Here's the inner loop:

    for(i = 1; i < PAGE_LENGTH - 1; i++) {
        *toPage++ = *prevPage + ((*fromPage * *(prevPage + 1)) ^ *(fromPage
- 1));
        prevPage++;
        fromPage++;
    }

Each iteration can be run in parallel, and PAGE_LENGTH is 16K.  Memory
bandwidth seems to be the limiting speed factor.  I'm doing about 12GB/s,
but memmove seems to be able to hit 22GB/s.  From my memmove tests, my
machine seems to max out around 22 GB/second of memory bandwidth.  Can we
do better with SSE instructions and temporal cache instructions?

Again, thanks for the review and all the benchmark data!  I wish I could
get you to SIMD optimize my code.  There are still very significant issues
I don't know the answers to, like is it OK to have that 64-bit multiply in
there?  I had it based on 2 32-bit multiplies and a couple shifts, but
64-bit ran faster everywhere I tested it.  I tried converting the code to
all 32-bit, but it doubled my run-times.  If you don't mind working on two
hashing projects in the same competition, I'd really appreciate having
another author.  Two chances to win is better :-)

Content of type "text/html" skipped