phc-discussions - Re: EARWORM speed and improved defense

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p73XDTpO54qzcsA-WJF7MQiwEKLH4tGUyhOsMO79sjZvA@mail.gmail.com>
Date: Mon, 21 Apr 2014 13:50:12 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: EARWORM speed and improved defense

On Sun, Apr 20, 2014 at 9:59 AM, Bill Cox <waywardgeek@...il.com> wrote:

> Here's the raw data when calling EARWORM's PHS function:
> PHC> time ./phs-earworm $((1 << 20)) 10000
> Creating arena
> Allocating 268435456 bytes
> Hashing password
> count:1048576
>
> 7b 46 5d db ff b2 ef 85
> 25 af 9a 5a 0c 65 76 74
> 78 62 ff 52 92 e0 ca 1e
> d5 6f 59 15 fc 40 10 d2      32 (octets)
>
>
> real    1m25.882s
> user    1m25.907s
> sys     0m0.017s
>
> It spent at least 2 seconds creating the "arena" which is a shared ROM.
>  Subtracting that out, and I get:
>
> Total memory hashed: 1 TiB
> Time: 83 seconds
> Bandwidth: 12.3 GiB/s
>
> This is the highest memory bandwidth I've measured, though I've factored
> out memory allocation, and if I did that with Yescript and Lyra2, they
> might be close to this.
>
> Bill
>

I reran EARWORM with some tweaks.  I set the "width" to 512, and the
"depth" to 2, leaving each work unit as 1MiB, but increasing the internal
scratchpad size to 8KiB from 512 bits.  I also had to comment out some
calls to prf, since they were slowing down the algorithm considerably.  The
modified code does:

14.48 GiB/s

Wow!  That's the new record.  This tweaked version has better ROM bandwidth
hardening, because an attacker splitting the ROM into N pieces will have to
transmit 4KiB of scratchpad data between nodes every time it updates a
scratchpad state, and the new node will only be able to hash 4KiB of data
into the scratchpad state before it has to be transmitted again.  However,
it has worse runtime hardening, because all 256 AESINC hashes can be
computed in parallel.  This is likely also why the bandwidth increased.

I put back the constants the way they were originally, made the scratchpad
array EARWORM_CHUNK_AREA+4 in size, and then I chained the AESINC
computations in 4 lanes, with a simple hack, and now I get:

12.44 GiB/s

This version has both large enough state size and compute time hardening.
 My modified code is:

  /* Main loop */
  for(d = 0; d < EARWORM_WORKUNIT_DEPTH; d+=2) {
    __m128i *s = scratchpad;
    for(l = 0; l < EARWORM_CHUNK_LENGTH; l++) {
      for(w = 0; w < EARWORM_CHUNK_WIDTH; w++) {
        s[4] = _mm_aesenc_si128(*s, arena[arena_index_a++]);
        s++;
      }
    }
    memmove(scratchpad, scratchpad + EARWORM_CHUNK_AREA, 4*sizeof(__m128i));
    arena_index_a = to_index(scratchpad[0], m_cost);
    _mm_prefetch(&arena[arena_index_a], _MM_HINT_T0);

    s = scratchpad;
    for(l = 0; l < EARWORM_CHUNK_LENGTH; l++) {
      for(w = 0; w < EARWORM_CHUNK_WIDTH; w++) {
        s[4] = _mm_aesenc_si128(*s, arena[arena_index_b++]);
        s++;
      }
    }
    memmove(scratchpad, scratchpad + EARWORM_CHUNK_AREA, 4*sizeof(__m128i));
    arena_index_b = to_index(scratchpad[0], m_cost);
    _mm_prefetch(&arena[arena_index_b], _MM_HINT_T0);
  }

I think this is on the right track.  In particular, EARWORM in its current
form can be attacked both in terms of cost per guessing node (reducing the
ROM per node by a factor of N) and how many guesses can be computed in
parallel on each node, attacking both the time and cost dimensions of
EARWORM's defense.  With only 512 state bits per ongoing password hash, and
a 1TiB ROM split 256 ways, it is reasonable for each node to have 100
million or more password guesses stored in local RAM, giving us an
opportunity for 100 EARWORM hashing cores to work in parallel per node.
 This modification above seems to harden EARWORM's time dimension against
attacks that split the ROM N ways.  With this change, each node will only
be able to do about one EARWORM hash at a time.

There's still a simple attack to reduce an attacker's ROM cost per node,
but it's not EARWORM specific, so I'll post it on another thread.

Bill

Content of type "text/html" skipped