lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Feb 2014 11:23:29 -0500
From: Bill Cox <>
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 10:57 AM, Samuel Neves <> wrote:
> On 13-02-2014 15:40, Solar Designer wrote:
>> On Wed, Feb 12, 2014 at 06:20:38PM -0500, Bill Cox wrote:
>>> I wrote an SSE 4.1 version of the inner hash loop in NoelKDF, with a
>>> quick hack where I compute 4 separate interleaved hashes, so it gets a
>>> different answer, and has 4 times the parallelism.  I used the fast
>>> packed 32x32->32 multiply available on my Ivy Bridge processor
>>> (_mm_mullo_epi32), which we know will run a lot slower on Haswell.
>>> This instruction isn't even available until SSE 4.1, so that's a
>>> narrow band of processors that support it well.
>> Right, although Samuel Neves posted nice emulation of it via SSE2 in:
>> Date: Sun, 09 Feb 2014 05:47:29 +0000
>> From: Samuel Neves <>
>> To:
>> Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)
> By the way, as an addendum to this, I should point out that the
> performance of running 4 64x64->64 multiplications in parallel in
> Haswell is not significantly worse than VPMULLD.
> The best measured latency for emulated VPMULLD I got (with the SSE2
> version) is 9 Haswell cycles, whereas emulating a 64x64->64-bit
> multiplication (using 3 VPMULUDQ and a few shifts/additions) can be done
> in 11 cycles of latency. Native VPMULLD itself is 10 cycles.

Nice!  I measured the inner loop on my Ivy Bridge processor without
any memory bandwidth limitations, and it does 1 iteration every 4
clocks.  The multiply takes 3 and the add takes 1, so this seems as
good as possible.  Everything else is done in parallel, at least when
running only 1 thread.

The only problem with 9 cycles for VPMULLD or 11 for VPMULUDQ is that
an ASIC attacker will do this same computation in 4-ish cycles.  He
sees no latency increase when adding more parallelism.
Multi-threading seems to offer a way to increase memory bandwidth
without increasing the multiply-loop latency as much as SIMD.
However, the combination increases memory bandwidth more than I can
achieve otherwise, on my machine from 10GB/sec to 13GB/sec.  Memmove
only achives 17GB/sec, so this is pretty good, though 2 threads
achieves 22GB/sec (maybe someone should upgrade memmove to do this by

On the other hand, the SIMD version takes only one CPU, while 2
threads takes 2.  The SIMD version may be better for systems with lots
of users.

I'm still torn as to whether or not to offer the SIMD option.  Higher
memory bandwidth is crucial, but I don't like losing compute time
hardening as a result.  I wish the NSA could weigh in on this and let
us know which would frustrate their password crackers more.  Either
way, I think we're down to the last 30% in how much we make an
attacker pay for hardware.  That's a good place to be.


Powered by blists - more mailing lists