phc-discussions - TwoCats multiplication chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140322191655.GA25041@openwall.com>
Date: Sat, 22 Mar 2014 23:16:55 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: TwoCats multiplication chain

Bill,

In current TwoCats, you have:

                // Compute the multiplication chain
                for(uint32_t k = 0; k < multiplies; k++) {
                    v = (uint32_t)v * (uint64_t)oddState[k];
                    v ^= randVal;
                    randVal += v >> 32;
                }

I think this has twice lower ASIC attack latency than what you
probably expected.  Here are some test questions: can the next
multiplication start being computed before the previous one completes?
Can this reduce total latency for two multiplications, and by how much?

In a 32x32 multiplier, the highest latency is for bit 31 of the result.
I look at these diagrams when considering what happens there:

http://www10.edacafe.com/book/ASIC/Book/Book/Book/CH02/CH02.6.php?interstitial_displayed=Yes#pgfId=167904

In your code, only one of the partial products for bit 31 depends on the
previous iteration's bit 31 of v.  Thus, almost all of the partial
products can start being computed before bit 31 of the previous
iteration's v is ready.

I think "randVal += v >> 32;" limits the speedup from such overlapped
computation to a factor of 2, since each bit from 32 to 63 depends on
bit 31 of the multiplicand, and this addition with carry makes all of
them potentially propagate to bit 31 of randVal, and thus to v, but this
final propagation only happens with a delay of one loop iteration.

I think changing the multiplication chain to:

                for(uint32_t k = 0; k < multiplies; k++) {
                    v = (uint32_t)v * (uint64_t)oddState[k];
                    randVal += v >> 32;
                    v ^= randVal;
                }

defeats this attack.  Unfortunately, this has slightly lower efficiency
for the defender (more time is spent on latencies of non-multiplication
operations, which don't cost as much latency in ASIC).

On a related note, now that you've moved to 32x32->64, which is
lossless, I think you don't need oddState[] anymore (you can use state[]
directly, without the "| 1").

Alexander