lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 11 Mar 2014 05:10:45 -0400
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply latency reduction via table lookups

On Tue, Mar 11, 2014 at 1:05 AM, Solar Designer <solar@...nwall.com> wrote:
> Epiphany has single-cycle reads from the core's local memory (32 KB), at
> clock rates up to 800 MHz for 65nm and up to 1 GHz for 28nm (I'm not
> sure if this is limited by timings or rather by energy efficiency, which
> Epiphany focuses on, and even if by timings the bottleneck might be
> elsewhere).  Reads from nearby cores' local memory (thus, extra 128 KB)
> cost 3 cycles more.

Could be size, too.  The RAM is the one thing that doesn't shrink, and
if you want it to go 2X faster, it usually makes it a lot larger.  2X
larger isn't the right number, but it's close enough to get a feel for
it.  This is one reason small RAMs are so much more expensive then
large block RAM, even on the same chip.

>> The fastest small RAM timings I've read
>> have been around 1ns, so there's no help there, but I have not read
>> timing for 28nm RAMs.
>
> POWER7's 2 cycle latency at 3.55 GHz is 0.56ns.

That's a pretty sweet cache RAM.

>> At the very least, multiplication makes it much more challenging to
>> make it go fast.
>
> It certainly appears so.
>
>> I'm not sure if RAMs will help or not... I would
>> have to try it out.  Are there any RAM designers we can ask?
>
> I'm not sure.  We could try asking Andreas about their experience with
> Epiphany's local memory and whether it'd go at much higher clock rates,
> as well as if it can be made larger while maintaining the latency.
>
> Alexander

I'd bet he could tell us.  It's dumb, but the guys who sell IP like
embedded RAMs wont tell you their performance until you get an NDA and
convince them you are a potential buyer.  That just makes it hard to
figure out if you would like to buy their product.  For example:

http://www.businesswire.com/news/home/20131028006116/en/sureCore-Tapes-out-Power-SRAM-IP-Demonstrator-Chip#.Ux7O7_mwJcR

The press release says they have a 28nm SRAM IP that cuts power in
half.  Half from what?  How fast is it?  How big is it?  If you go to
their web site, they don't even mention what IP they sell, just that
they sell IP.  You have to send them an email to find out.  I read a
Ph.D. thesis recently that covered advances in multiplier design, and
they had a table comparing various architectures at 28nm.  All the
units had been removed and the slowest speed had been replaced with
1.0, and the rest were relative.

It's the same way with mask costs.  No one will tell you they require
$1M up front to make the reticles at 28nm.  Is it only $750K now days,
or is it still > $1M?  Who knows?  Maybe Andreas does.  He's the guy
who makes super-tiny chips so he can build them in low volume at 28nm,
right?  I like his 4096 core variant a lot more than his 64 core
version.  The difference is most likely that huge up-front mask cost.

However, someone is going to build that device or one like it with a
full mask set and at least 1cm^2 of silicon at 28nm.  It's going to
have a sick number of cores, and it will probably hash Bcrypt like
nothing we've seen before.

Bill

Powered by blists - more mailing lists