phc-discussions - multiply latency reduction via table lookups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140310080229.GA9001@openwall.com>
Date: Mon, 10 Mar 2014 12:02:29 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: multiply latency reduction via table lookups

Bill -

It is known that e.g. a 32x32->64 multiply can be computed via four
16x16->32 multiplies.  Moreover, these four can run in parallel (they
are then followed by some ADDs, but this is sub-cycle latency in ASIC).
e.g. here's code I wrote for JtR (for portability to ancient platforms)
back in 1990s:

void mul32by32(int64 *dst, unsigned int m1, unsigned int m2)
{
	dst->lo = (m1 & 0xFFFF) * (m2 & 0xFFFF);
	dst->hi = 0;

	add32to64m(dst, (m1 >> 16) * (m2 & 0xFFFF));
	add32to64m(dst, (m2 >> 16) * (m1 & 0xFFFF));

	dst->hi += (m1 >> 16) * (m2 >> 16);
}

(The same algorithm can be found in plenty of places.  I just happened
to reinvent it at the time.)

Now, a 16x16->32 lookup table is still pretty large.  A naive
implementation of it would take 16 GiB.  A smarter implementation would
probably halve that, due to multiplication being commutative.  (Can we
do better yet?)  Currently, the lookup latency for a table this large
would probably be way in excess of 3 cycles (assuming a decent clock
rate) that we've been assuming an optimal multiplication circuit needs.

However, what if we apply the same approach recursively?  With just one
more application of it, we need multiplies as small as 8x8->16, which
means a 128 KiB lookup table (one with many read ports, or multiple such
tables), or perhaps twice smaller.  Everything else is just ADDs, and
fewer levels of them than a table-less multiplication circuit would
contain.

Thus, a 32x32->64 multiply appears to be equivalent to a bunch of
parallel lookups from a 128 KiB table, followed by a few levels of ADDs
(not too many?)

Can this be used to perform 32x32->64 multiplication in fewer than 3
cycles at a decent clock rate (same as with a table-less multiplier)?

Do e.g. CPUs use table lookups like this for multiplication already?
Is this possibly how they (and some ASICs) achieve the 3 cycles already,
meaning that we shouldn't worry about this allowing for < 3 cycles?
If so, the die area consumed by fast multipliers may be more than we had
estimated.

Alexander