phc-discussions - Please consider multiplication hardening

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p7iQwWDOMcRfp+LtOTQhXZqjtKHcU6+gd9A8+ppdAF2gw@mail.gmail.com>
Date: Sat, 1 Mar 2014 15:51:26 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Please consider multiplication hardening

Solar designer suggested the idea, and I just happen to know enough
about ASICs to figure out that it was a good one.  I'm sure Solar
Designer feels the same: please feel free to use multiplication
compute-time hardening in your own PHC entry.

After noodling on how best to build semi-configurable chips to attack
the PHC winner(s), I no longer believe high memory bandwidth is enough
to slow down government sponsored attackers sufficiently.  A
sequential chain of multiplication and one or more of add, XOR, and
rotate (though not just an add), seems to be a winning combination for
providing strong compute-time hardening, because ASICs and hand
designed custom chips cannot compute serial multiplications
substantially faster than our computer's CPU.

I would feel honored rather than offended if you guys "borrowed" this
idea to strengthen your own algorithms.

Some tips for implementation:

Intel and AMD CPUs have excellent SSE/AVX2 vector instructions for
hashing memory fast.  There is a decent 4x32->2x64 SSE multiplication
instruction, but the 4x32->4x32 instruction is too slow in Haswell,
Intels latest architecture.  The fastest multiplier is in the regular
integer units.  My Core i7-3370 does a multiplication in 3 cycles, or
about 0.88ns.  I know of no ASIC implementation of a 32x32->32
multiplier that is faster (though I strongly suspect there is one -
see below).

To keep the serial multiplication chain from stealing time away from
memory hashing, consider running the multiplication chain on the
scalar integer unit, while hashing memory using the SSE or AVX2 unit.
They can both run in parallel, but they contend for the same L1 data
bus.

To keep them running fast, have the multiplication chain run out of
CPU registers, and let the SSE/AVX2 unit have full access to the L1
data bus.  Pointers to memory used by the SSE/AVX2 code live in
regular CPU registers, so if you need a value, consider casting it to
a uint32_t * or a uint64_t * and load some of the same data
read/written by the SSE/AVX2 unit.

You should have a configurable multiplication chain length, since
memory speed and multiplication speed are not well correlated.  For
example, when running out of L1 cache on Haswell, even 1 3-cycle
multiplication and a 1-cycle XOR takes longer than a 32-byte read,
XOR, addition, rotation, and write.  When I run out of external DDR3
memory, I need 3 to 5 multiplications for that same SSE/AVX2 memory
hash operation.  You'll want to be able to have 0 or more multiplies
to compute-harden the hashing to the user's machine.  I allow
multiplication chains lengths to be adapted from 0 to 8 long and run
that out of registers in the regular scalar unit in a loop that hashes
and writes 32 bytes of memory in the SSE/AVX2 unit.  I was able to
fool the compiler into instantiating 9 different versions of the
hashing loop, one for each multiplication chain length, but there are
probably other ways to do it.

If you do it right, the multiplication chains will not hurt your
memory hashing performance at all on Intel and AMD CPUs.  It will
simply make your algorithm stronger.

I am aware of credible rumors that at least in the late 1990's a US
government password cracking center was built that used cryogenic
custom ASIC multipliers, and a modern version of those will run faster
than our CPU multipliers (maybe 3X faster? - I don't know).  However,
those same ASICs will also run all the other instructions faster, and
multiplication remains our best bet for compute time hardening.

Thanks,
Bill