lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140311050549.GC12707@openwall.com>
Date: Tue, 11 Mar 2014 09:05:49 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups

On Mon, Mar 10, 2014 at 09:59:14AM -0400, Bill Cox wrote:
> The quarter squared thing is cool.  Roughly, it looks like the initial
> add/sub can be done in parallel, two 16-bit lookups could be done in
> parallel in a clock cycle most likely, though that's bigger than an
> Intel L1 cache, isn't it?   Also, they don't get 1 cycle latency, I
> think it's 2 or 3 to L1, isn't it?

On most CPUs with high clock rates, L1 cache reads take 3+ cycles.  Such
info across many CPUs (including non-x86) is available at:

http://7-cpu.com

Here's the only CPU on that site with single-cycle latency from L1:

http://7-cpu.com/cpu/K5.html

Low clock rate, indeed, but at the time it was sane (albeit 2x to 3x
lower than contemporary Intel CPUs', which may be what allowed for the
lower latency here).

These have 2 cycles from L1:

http://7-cpu.com/cpu/Power7.html - 3.55 GHz, 32 KB
http://7-cpu.com/cpu/US3.html - 1 GHz, 64 KB
http://7-cpu.com/cpu/Itanium2.html - 1.3 GHz, 16 KB
http://7-cpu.com/cpu/P4-130.html - 2.2 GHz, 8 KB
http://7-cpu.com/cpu/P4-180.html
http://7-cpu.com/cpu/US2.html
http://7-cpu.com/cpu/P-MMX.html
http://7-cpu.com/cpu/Mips4K.html

Epiphany has single-cycle reads from the core's local memory (32 KB), at
clock rates up to 800 MHz for 65nm and up to 1 GHz for 28nm (I'm not
sure if this is limited by timings or rather by energy efficiency, which
Epiphany focuses on, and even if by timings the bottleneck might be
elsewhere).  Reads from nearby cores' local memory (thus, extra 128 KB)
cost 3 cycles more.

> The fastest small RAM timings I've read
> have been around 1ns, so there's no help there, but I have not read
> timing for 28nm RAMs.

POWER7's 2 cycle latency at 3.55 GHz is 0.56ns.

> At the very least, multiplication makes it much more challenging to
> make it go fast.

It certainly appears so.

> I'm not sure if RAMs will help or not... I would
> have to try it out.  Are there any RAM designers we can ask?

I'm not sure.  We could try asking Andreas about their experience with
Epiphany's local memory and whether it'd go at much higher clock rates,
as well as if it can be made larger while maintaining the latency.

Alexander

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ