lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 15 Feb 2014 10:07:46 -0500
From: Bill Cox <>
Subject: Re: OT: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

Yesterday I tested blending scalar multiplications and fast SSE memory
hashing all on the same thread.  This morning I'm testing SSE memory
hashing on 1 or more threads, and multiplication hashing on a single
thread.  It gives awesome results.

With 2 memory threads doing basic SSE memory hashing, and one
multiplication thread using the current NoelKDF hash function, but
only on a memory size of 256 bits, I am able to hash 2GB in 0.31
seconds, while doing 240,711,928 multiplications.  This is 68%
multiplication time hardness, without even factoring out the memory
allocation overhead.  It's about 13BG/s bandwidth.

This is just unbelievable performance.  It will likely drop some when
I add all the bells and whistles.  Obviously hard-coding 240,711,928
is not a great idea.  I'm thinking that during initial password
hashing when a user enters his password the first time, we could run
for a specified period of time, and simply measure how much memory we
hashed and how many multiplications we did.  Also, the number of
parallel memory hash threads to have may need to be tunable.  The
right number on my machine is 2.

This is so much better than the current NoelKDF, that I think I need
to rewrite it.

Now that I'm back to the drawing board, I need two hash functions: a
multiplication hash function, and a memory hash function.  Here's the
multiplication hash loop I used.  'hash' is an array of 8 uint32_t's.
Is it good enough?

    for(i = 0; i < 240711928/8; i++) {
        uint32_t j;
        for(j = 0; j < 8; j++) {
            value = (value*(hash[j] | 1)) ^ (hash[7-j] >> 1);
            hash[j] = value;

I guess I at least test this for collisions at the end.  I could also
dump the state every 64 outer loops or so and see if it passes the
dieharder tests.

For memory hashing, I used:

    __m128i *prev1 = mem + BLOCK_LENGTH;
    __m128i *prev2 = mem;
    __m128i *to = mem + 2*BLOCK_LENGTH;
    __m128i shiftRightVal = _mm_set_epi32(25, 25, 25, 25);
    __m128i shiftLeftVal = _mm_set_epi32(7, 7, 7, 7);
    for(i = 2; i < NUM_BLOCKS; i++) {
        uint32_t randAddr = *(uint32_t *)prev1;
        __m128i *from = mem + BLOCK_LENGTH*(randAddr % i);
        uint32_t j;
        for(j = 0; j < BLOCK_LENGTH; j++) {
            __m128i value = _mm_add_epi32(*prev1++, *prev2++);
            value = _mm_xor_si128(value, *from++);
            // Rotate right 7
            value = _mm_or_si128(_mm_srl_epi32(value, shiftRightVal),
_mm_sll_epi32(value, shiftLeftVal));
            *to++ = value;

In English, this is 4 parallel lanes (I should probably make it 8 or
maybe adjustable), of 32-bit hashes, where each of the 4 lanes is
computed as:

    mem[i + 4] = ROTATE((*prev1 + *prev2) ^ *from, 7)
    prev1 += 4; prev2 += 4, from += 4

Here, ROTATE rotates right by 7, and the low 7 bits are moved to the
upper 8 bits.  For some reason, I can do a lot of _mm instructions
without slowing down writes to memory.  It seems to be somehow limited
by reading *from values.

I'll hash a 100KB or so, and then do a SHA256 (what do you recommend?)
on the 256 bits of state in the multHash, and also on the 256 bits of
the parallel 'value' values once I increase it to 8 lanes.  Does this
sound reasonable?

Should I e-mail to the PHC submission list to hold off evaluating
NoelKDF, because I'm doing a major overhaul?  Can I blame


Powered by blists - more mailing lists