phc-discussions - Re: [PHC] multiply latency reduction via table lookups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140311050549.GC12707@openwall.com>
Date: Tue, 11 Mar 2014 09:05:49 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups

On Mon, Mar 10, 2014 at 09:59:14AM -0400, Bill Cox wrote:
> The quarter squared thing is cool.  Roughly, it looks like the initial
> add/sub can be done in parallel, two 16-bit lookups could be done in
> parallel in a clock cycle most likely, though that's bigger than an
> Intel L1 cache, isn't it?   Also, they don't get 1 cycle latency, I
> think it's 2 or 3 to L1, isn't it?

On most CPUs with high clock rates, L1 cache reads take 3+ cycles.  Such
info across many CPUs (including non-x86) is available at:

http://7-cpu.com

Here's the only CPU on that site with single-cycle latency from L1:

http://7-cpu.com/cpu/K5.html

Low clock rate, indeed, but at the time it was sane (albeit 2x to 3x
lower than contemporary Intel CPUs', which may be what allowed for the
lower latency here).

These have 2 cycles from L1:

http://7-cpu.com/cpu/Power7.html - 3.55 GHz, 32 KB
http://7-cpu.com/cpu/US3.html - 1 GHz, 64 KB
http://7-cpu.com/cpu/Itanium2.html - 1.3 GHz, 16 KB
http://7-cpu.com/cpu/P4-130.html - 2.2 GHz, 8 KB
http://7-cpu.com/cpu/P4-180.html
http://7-cpu.com/cpu/US2.html
http://7-cpu.com/cpu/P-MMX.html
http://7-cpu.com/cpu/Mips4K.html

Epiphany has single-cycle reads from the core's local memory (32 KB), at
clock rates up to 800 MHz for 65nm and up to 1 GHz for 28nm (I'm not
sure if this is limited by timings or rather by energy efficiency, which
Epiphany focuses on, and even if by timings the bottleneck might be
elsewhere).  Reads from nearby cores' local memory (thus, extra 128 KB)
cost 3 cycles more.

> The fastest small RAM timings I've read
> have been around 1ns, so there's no help there, but I have not read
> timing for 28nm RAMs.

POWER7's 2 cycle latency at 3.55 GHz is 0.56ns.

> At the very least, multiplication makes it much more challenging to
> make it go fast.

It certainly appears so.

> I'm not sure if RAMs will help or not... I would
> have to try it out.  Are there any RAM designers we can ask?

I'm not sure.  We could try asking Andreas about their experience with
Epiphany's local memory and whether it'd go at much higher clock rates,
as well as if it can be made larger while maintaining the latency.

Alexander