phc-discussions - Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Sun, 9 Feb 2014 08:01:43 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Sun, Feb 9, 2014 at 1:18 AM, Samuel Neves <sneves@....uc.pt> wrote:
> On 09-02-2014 04:26, Bill Cox wrote:
>> By the way, if you're familiar with results from skilled hand layout
>> of such circuits in advanced IC processes, I'd love to hear your
>> thoughts on Salsa20/8 latency with a custom hand-optimized layout in
>> the fastest processes.
>
> Sorry, can't help you there; my knowledge is limited to the software
> side of things. That said, your estimate does look reasonable: [1]
> reports 4 cycles (one cycle per double-round) for the highest-area
> implementation, 8 cycles at roughly half the area, or 32 cycles at 1/4th
> of the area. The most efficient (in throughput/gates) is the second choice.
>
> [1] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4746906

4 cycles per Salsa20/8 sounds about right to me.  In this paper, from
the abstract, they synthesized (not manually hand-crafted layout) to
180nm.  It probably ran < 500MHz, if I had to guess (I have to ask for
access to papers through work, and password hashing requests annoy
them).  Scale it to 3.5-ish GHz, and it's probably a decent upper
bound for today's hand-crafted tech.  By the way, if you're doing a
28nm ASIC attack, I don't have any idea why anyone would be so lazy
that they synthesize the core rather than drawing every single polygon
by hand.  Once you've put out over $1M just for NRE for a full mask
set, why not go the extra mile an pay a layout guru $200K to build an
awesome hand-crafted core?  It's not like these transistors are all
different... it's the same full-adder + XOR replicated thousands of
times.

As for SIMD implementations, I'm afraid I'm a complete noob.  I really
shouldn't be trying to discuss SIMD with you guys... I'm way out of my
league here, but...

Wouldn't we want PMULLW rather than PMULUDQ?  4 packed 32x32->32 lower
mimics the C 32-bit multiply, and gives the same result for both
signed and unsigned multiplication, which is nice for Java
compatibility.  Either way, I see 5 cycle latency in Haswell here:

http://users.atw.hu/instlatx64/GenuineIntel00306C3_HaswellXeon_InstLatX86.txt

I see 11 cycle latency for some instructions here:

http://users.atw.hu/instlatx64/GenuineIntel00406D8_Avoton_InstLatX64.txt

However, PMULLW is listed at 5 cycles latency even for this Atom CPU.
Just for comparison, it looks like my Ivy Bridge quad-core i7 does the
complete hash function in around 6 clocks on average just with gcc -O2
optimization.  I was guestimating 9 clocks for 4-way SIMD, so about
2.6X faster than a single thread doing it 4 times.  Is this about
right?  I get almost 2X speed increase for 2 threads, so the latency
per thread doesn't take much of a hit in that case.

As before, 2 threads almost double the throughput, with maybe 7-ish
clocks per hash per thread, and going to more threads helps a bit, but
not enough to justify the increase in latency per hash.

Going back to constant multiplication, my slow algorithm finished
overnight and got the same 7-add/sub answer.  Here's it's computation
of 1812433253:

2 = 1 << 1
3 = 2 + 1
24 = 3 << 3
27 = 24 + 3
55296 = 27 << 11
16 = 2 << 3
15 = 16 - 1
55311 = 55296 + 15
1812430848 = 55311 << 15
19 = 16 + 3
2432 = 19 << 7
2405 = 2432 - 27
1812433253 = 1812430848 + 2405

That was a fun little hack :-)

Bill