| lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
|
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAOLP8p7xCy68uQ7Tz5Oj=o5r1xFZBG4R6adizxqspHG8zBKyDQ@mail.gmail.com> Date: Sun, 9 Feb 2014 08:01:43 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission) On Sun, Feb 9, 2014 at 1:18 AM, Samuel Neves <sneves@....uc.pt> wrote: > On 09-02-2014 04:26, Bill Cox wrote: >> By the way, if you're familiar with results from skilled hand layout >> of such circuits in advanced IC processes, I'd love to hear your >> thoughts on Salsa20/8 latency with a custom hand-optimized layout in >> the fastest processes. > > Sorry, can't help you there; my knowledge is limited to the software > side of things. That said, your estimate does look reasonable: [1] > reports 4 cycles (one cycle per double-round) for the highest-area > implementation, 8 cycles at roughly half the area, or 32 cycles at 1/4th > of the area. The most efficient (in throughput/gates) is the second choice. > > [1] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4746906 4 cycles per Salsa20/8 sounds about right to me. In this paper, from the abstract, they synthesized (not manually hand-crafted layout) to 180nm. It probably ran < 500MHz, if I had to guess (I have to ask for access to papers through work, and password hashing requests annoy them). Scale it to 3.5-ish GHz, and it's probably a decent upper bound for today's hand-crafted tech. By the way, if you're doing a 28nm ASIC attack, I don't have any idea why anyone would be so lazy that they synthesize the core rather than drawing every single polygon by hand. Once you've put out over $1M just for NRE for a full mask set, why not go the extra mile an pay a layout guru $200K to build an awesome hand-crafted core? It's not like these transistors are all different... it's the same full-adder + XOR replicated thousands of times. As for SIMD implementations, I'm afraid I'm a complete noob. I really shouldn't be trying to discuss SIMD with you guys... I'm way out of my league here, but... Wouldn't we want PMULLW rather than PMULUDQ? 4 packed 32x32->32 lower mimics the C 32-bit multiply, and gives the same result for both signed and unsigned multiplication, which is nice for Java compatibility. Either way, I see 5 cycle latency in Haswell here: http://users.atw.hu/instlatx64/GenuineIntel00306C3_HaswellXeon_InstLatX86.txt I see 11 cycle latency for some instructions here: http://users.atw.hu/instlatx64/GenuineIntel00406D8_Avoton_InstLatX64.txt However, PMULLW is listed at 5 cycles latency even for this Atom CPU. Just for comparison, it looks like my Ivy Bridge quad-core i7 does the complete hash function in around 6 clocks on average just with gcc -O2 optimization. I was guestimating 9 clocks for 4-way SIMD, so about 2.6X faster than a single thread doing it 4 times. Is this about right? I get almost 2X speed increase for 2 threads, so the latency per thread doesn't take much of a hit in that case. As before, 2 threads almost double the throughput, with maybe 7-ish clocks per hash per thread, and going to more threads helps a bit, but not enough to justify the increase in latency per hash. Going back to constant multiplication, my slow algorithm finished overnight and got the same 7-add/sub answer. Here's it's computation of 1812433253: 2 = 1 << 1 3 = 2 + 1 24 = 3 << 3 27 = 24 + 3 55296 = 27 << 11 16 = 2 << 3 15 = 16 - 1 55311 = 55296 + 15 1812430848 = 55311 << 15 19 = 16 + 3 2432 = 19 << 7 2405 = 2432 - 27 1812433253 = 1812430848 + 2405 That was a fun little hack :-) Bill
Powered by blists - more mailing lists