[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOLP8p7xCy68uQ7Tz5Oj=o5r1xFZBG4R6adizxqspHG8zBKyDQ@mail.gmail.com>
Date: Sun, 9 Feb 2014 08:01:43 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)
On Sun, Feb 9, 2014 at 1:18 AM, Samuel Neves <sneves@....uc.pt> wrote:
> On 09-02-2014 04:26, Bill Cox wrote:
>> By the way, if you're familiar with results from skilled hand layout
>> of such circuits in advanced IC processes, I'd love to hear your
>> thoughts on Salsa20/8 latency with a custom hand-optimized layout in
>> the fastest processes.
>
> Sorry, can't help you there; my knowledge is limited to the software
> side of things. That said, your estimate does look reasonable: [1]
> reports 4 cycles (one cycle per double-round) for the highest-area
> implementation, 8 cycles at roughly half the area, or 32 cycles at 1/4th
> of the area. The most efficient (in throughput/gates) is the second choice.
>
> [1] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4746906
4 cycles per Salsa20/8 sounds about right to me. In this paper, from
the abstract, they synthesized (not manually hand-crafted layout) to
180nm. It probably ran < 500MHz, if I had to guess (I have to ask for
access to papers through work, and password hashing requests annoy
them). Scale it to 3.5-ish GHz, and it's probably a decent upper
bound for today's hand-crafted tech. By the way, if you're doing a
28nm ASIC attack, I don't have any idea why anyone would be so lazy
that they synthesize the core rather than drawing every single polygon
by hand. Once you've put out over $1M just for NRE for a full mask
set, why not go the extra mile an pay a layout guru $200K to build an
awesome hand-crafted core? It's not like these transistors are all
different... it's the same full-adder + XOR replicated thousands of
times.
As for SIMD implementations, I'm afraid I'm a complete noob. I really
shouldn't be trying to discuss SIMD with you guys... I'm way out of my
league here, but...
Wouldn't we want PMULLW rather than PMULUDQ? 4 packed 32x32->32 lower
mimics the C 32-bit multiply, and gives the same result for both
signed and unsigned multiplication, which is nice for Java
compatibility. Either way, I see 5 cycle latency in Haswell here:
http://users.atw.hu/instlatx64/GenuineIntel00306C3_HaswellXeon_InstLatX86.txt
I see 11 cycle latency for some instructions here:
http://users.atw.hu/instlatx64/GenuineIntel00406D8_Avoton_InstLatX64.txt
However, PMULLW is listed at 5 cycles latency even for this Atom CPU.
Just for comparison, it looks like my Ivy Bridge quad-core i7 does the
complete hash function in around 6 clocks on average just with gcc -O2
optimization. I was guestimating 9 clocks for 4-way SIMD, so about
2.6X faster than a single thread doing it 4 times. Is this about
right? I get almost 2X speed increase for 2 threads, so the latency
per thread doesn't take much of a hit in that case.
As before, 2 threads almost double the throughput, with maybe 7-ish
clocks per hash per thread, and going to more threads helps a bit, but
not enough to justify the increase in latency per hash.
Going back to constant multiplication, my slow algorithm finished
overnight and got the same 7-add/sub answer. Here's it's computation
of 1812433253:
2 = 1 << 1
3 = 2 + 1
24 = 3 << 3
27 = 24 + 3
55296 = 27 << 11
16 = 2 << 3
15 = 16 - 1
55311 = 55296 + 15
1812430848 = 55311 << 15
19 = 16 + 3
2432 = 19 << 7
2405 = 2432 - 27
1812433253 = 1812430848 + 2405
That was a fun little hack :-)
Bill
Powered by blists - more mailing lists