phc-discussions - Re: [PHC] multiply latency reduction via table lookups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p5Qw+koXXA7iq9a=ngwBM1WHezaumOWYsTKu90i8sTr-Q@mail.gmail.com>
Date: Mon, 10 Mar 2014 20:57:58 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups

This thread has me thinking that we can compute-time harden our PHSs
(for Password Hashing Schemes) with L1 cache access times, as an
alternative to multiplication, if we want to.  There's a reason we
don't use small RAMs to accelerate multiplication: they're too slow!

Doing 32-byte random lookups in L1 cache and hashing it with one other
32 byte value, and writing it back to yet another random lookup, would
run really fast on Haswell, and I do not believe it would be easy for
an ASIC attacker to do better.

It would have to be done carefully, since we also want to write to as
much memory as possible to maximize time*memory cost for an attacker,
so somehow we'd need to figure out a strategy for hammering L1 cache
without cache miss penalties while at the same time filling external
DRAM at the bandwidth it can support.  The multiplications seem to
happen in parallel on the scalar unit, while memory hashing can happen
in the SIMD unit.  Is it possible to do something similar with cache
vs external DRAM hashing?

It's almost a mute point for my PHC entry at this point.  There's not
enough time left for another major rewrite.  However, I've been
considering moving the repeat loop I have to use more time from around
the lowest level block hashing to the outer memory hashing.  There are
arguments for both ways.  By having my repeat N times loop around the
low level block hash, I hammer L1 cache, while dropping DRAM
bandwidth.  Putting around the whole memory hash function does the
reverse.

I think attackers have a natural advantage on GPUs and ASICs in terms
of DRAM memory bandwidth (maybe 10X-ish the bandwidth per dollar), but
less of an advantage in L1 cache data rates.  This makes me more
tempted to focus on the L1 thrashing.

On the other hand, an ASIC or a highly parallel CPU with small local
memories, and fast clock speeds and low latencies, would tear up
hashes that tried to be L1 cache hardened.  By spending a long time
thrashing L1 cache rather than filling external memory, we play right
into an attacker's hands.

I don't know how to do both at the same time... Alexander?  This seems
like your kind of thing.

Bill