lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 10 Mar 2014 09:59:14 -0400
From: Bill Cox <>
Subject: Re: [PHC] multiply latency reduction via table lookups

The quarter squared thing is cool.  Roughly, it looks like the initial
add/sub can be done in parallel, two 16-bit lookups could be done in
parallel in a clock cycle most likely, though that's bigger than an
Intel L1 cache, isn't it?   Also, they don't get 1 cycle latency, I
think it's 2 or 3 to L1, isn't it?  So, 16x16->32 is hard to make
super fast.  In my own limited small block memory design experience,
my small RAMs run on the order of the time a multiplier would run.
I've got decode logic, long capacitive word lines to drive, small
wimpy transistors to pull down on large capacitive bit lines.
However, there are guys to take fast RAM design to an art form, and
magically eliminate a lot of that delay.

I've also seen some cool circuits to make adders faster which clamp
all the internal nodes at mid supply until the clock edge, and they
get data propagation delays per bit that are much lower.

So, can multiplication be done faster?  Maybe cryogenic cooling is the
simplest way... I don't know.  The fastest small RAM timings I've read
have been around 1ns, so there's no help there, but I have not read
timing for 28nm RAMs.

At the very least, multiplication makes it much more challenging to
make it go fast.  I'm not sure if RAMs will help or not... I would
have to try it out.  Are there any RAM designers we can ask?


Powered by blists - more mailing lists