| lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
|
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAOLP8p5yekA-VJP87LW_ADTzcT5LUYaR5OBxHNsKzpO8F-mKeg@mail.gmail.com> Date: Thu, 13 Feb 2014 11:23:29 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission) On Thu, Feb 13, 2014 at 10:57 AM, Samuel Neves <sneves@....uc.pt> wrote: > On 13-02-2014 15:40, Solar Designer wrote: >> On Wed, Feb 12, 2014 at 06:20:38PM -0500, Bill Cox wrote: >>> I wrote an SSE 4.1 version of the inner hash loop in NoelKDF, with a >>> quick hack where I compute 4 separate interleaved hashes, so it gets a >>> different answer, and has 4 times the parallelism. I used the fast >>> packed 32x32->32 multiply available on my Ivy Bridge processor >>> (_mm_mullo_epi32), which we know will run a lot slower on Haswell. >>> This instruction isn't even available until SSE 4.1, so that's a >>> narrow band of processors that support it well. >> Right, although Samuel Neves posted nice emulation of it via SSE2 in: >> >> Date: Sun, 09 Feb 2014 05:47:29 +0000 >> From: Samuel Neves <sneves@....uc.pt> >> To: discussions@...sword-hashing.net >> Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission) > > By the way, as an addendum to this, I should point out that the > performance of running 4 64x64->64 multiplications in parallel in > Haswell is not significantly worse than VPMULLD. > > The best measured latency for emulated VPMULLD I got (with the SSE2 > version) is 9 Haswell cycles, whereas emulating a 64x64->64-bit > multiplication (using 3 VPMULUDQ and a few shifts/additions) can be done > in 11 cycles of latency. Native VPMULLD itself is 10 cycles. Nice! I measured the inner loop on my Ivy Bridge processor without any memory bandwidth limitations, and it does 1 iteration every 4 clocks. The multiply takes 3 and the add takes 1, so this seems as good as possible. Everything else is done in parallel, at least when running only 1 thread. The only problem with 9 cycles for VPMULLD or 11 for VPMULUDQ is that an ASIC attacker will do this same computation in 4-ish cycles. He sees no latency increase when adding more parallelism. Multi-threading seems to offer a way to increase memory bandwidth without increasing the multiply-loop latency as much as SIMD. However, the combination increases memory bandwidth more than I can achieve otherwise, on my machine from 10GB/sec to 13GB/sec. Memmove only achives 17GB/sec, so this is pretty good, though 2 threads achieves 22GB/sec (maybe someone should upgrade memmove to do this by default). On the other hand, the SIMD version takes only one CPU, while 2 threads takes 2. The SIMD version may be better for systems with lots of users. I'm still torn as to whether or not to offer the SIMD option. Higher memory bandwidth is crucial, but I don't like losing compute time hardening as a result. I wish the NSA could weigh in on this and let us know which would frustrate their password crackers more. Either way, I think we're down to the last 30% in how much we make an attacker pay for hardware. That's a good place to be. Bill
Powered by blists - more mailing lists