lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 3 Apr 2015 15:25:11 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Compute time hardness

On Fri, Apr 03, 2015 at 03:13:06PM +0300, Solar Designer wrote:
> Here's another relevant detail I recalled:
> 
> Pentium 4 (some or all of them? not sure) had double-pumped ALU, where
> it could perform ADDs at double the clock rate (so up to 7.6 GHz, at
> stock clocks).
> 
> http://www.anandtech.com/show/1611/7
> http://forums.anandtech.com/showthread.php?t=603812
> https://news.ycombinator.com/item?id=8255157
> http://en.wikipedia.org/wiki/NetBurst_(microarchitecture)#Rapid_Execution_Engine
> 
> Apparently, this could only execute two dependent ADDs in a cycle if
> they are 16-bit each.  To me, this indicates that an ASIC would probably
> be able to do similar for 32-bit if it wanted to.

... and even 64-bit, since latency of a carry lookahead adder grows less
than linearly.  (I'd be interested to see actual latency vs. width data,
but I couldn't easily find any.)

> So I think this confirms 8x-ish difference in latency between fastest
> ADD and MUL.

Pentium 4 had two dependent (16-bit) ADDs per cycle at sub-4 GHz clock
rate at 180 nm.

Current CPUs need at least 3 cycles per MUL at sub-4 GHz clock rate at
22 nm.  (AMD APUs that have 2 cycles per MUL run at ~2 GHz.)

This suggests a 6x difference in latency if it were the same process and
same bit width.  Given that 180 nm vs. 22 nm is probably more of a
difference than 16-bit ADD vs. 64-bit ADD, I think 8x is more realistic.

Also, there's some per-instruction latency cost in a CPU, unlike in ASIC
(where there's no distinction between e.g. two ADDs that are part of a
Blake2 round and two ADDs that are part of a MUL).

Alexander

Powered by blists - more mailing lists