lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 4 Feb 2014 09:22:22 -0500
From: Bill Cox <>
Subject: Re: [PHC] FMA (Re: [PHC] Initial (non-proof-read) NeolKDF paper)

On Mon, Feb 3, 2014 at 7:14 PM, Solar Designer <> wrote:
> Bill,
> On Sun, Jan 26, 2014 at 12:43:33AM -0500, Bill Cox wrote:
>>     value = value*(mem[prevAddr++] | 3) + mem[randAddr++]
> Note that while this could be a fused multiply-add (FMA) operation on
> architectures that have one, your choice for which of the three inputs
> to replace makes it incompatible with some architectures that do have
> FMA - e.g., with Epiphany, and probably with more.  AMD addresses this
> by supporting a 4-operand FMA (making the output separate from the 3
> inputs), Intel addresses this by providing multiple forms of 3-operand
> FMA (they're currently floating-point only, though), but some archs
> don't address this issue (at least Epiphany does not, and I expect many
> more examples may be found if we look for them).  If we want to make
> this FMA-friendly, then it'd best to replace the same output that is
> replaced when doing matrix multiplication, since this is what all FMA
> implementations will support.  e.g. this would work:
>         value += mem[prevAddr++] | 3) * mem[randAddr++];
> but perhaps it does not meet your other requirements, or does it?

I should have guessed that modern hardware continues to implement the
multiply-accumulate primitive efficiently.  I'm testing this modified
hash function now.  How pissed will the PHC be if I keep re-submitting
improved version?  Can I blame you :-)

It's already passed some of the harder dieharder tests, so it looks
like it will be good enough.  For coding this up for SIMD, I'm
thinking of doing an 4-way memory interleaved version so that all the
data in the 1st loop lines up nicely as 128-bit memory accesses.  Is
this the right approach for making use of SIMD?


Powered by blists - more mailing lists