| lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
|
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAOLP8p5OFriMbd3n8XFh0xpZaRqhMHFxa6mLHZLYdssg9qq0sg@mail.gmail.com> Date: Tue, 4 Feb 2014 09:22:22 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] FMA (Re: [PHC] Initial (non-proof-read) NeolKDF paper) On Mon, Feb 3, 2014 at 7:14 PM, Solar Designer <solar@...nwall.com> wrote: > Bill, > > On Sun, Jan 26, 2014 at 12:43:33AM -0500, Bill Cox wrote: >> value = value*(mem[prevAddr++] | 3) + mem[randAddr++] > > Note that while this could be a fused multiply-add (FMA) operation on > architectures that have one, your choice for which of the three inputs > to replace makes it incompatible with some architectures that do have > FMA - e.g., with Epiphany, and probably with more. AMD addresses this > by supporting a 4-operand FMA (making the output separate from the 3 > inputs), Intel addresses this by providing multiple forms of 3-operand > FMA (they're currently floating-point only, though), but some archs > don't address this issue (at least Epiphany does not, and I expect many > more examples may be found if we look for them). If we want to make > this FMA-friendly, then it'd best to replace the same output that is > replaced when doing matrix multiplication, since this is what all FMA > implementations will support. e.g. this would work: > > value += mem[prevAddr++] | 3) * mem[randAddr++]; > > but perhaps it does not meet your other requirements, or does it? I should have guessed that modern hardware continues to implement the multiply-accumulate primitive efficiently. I'm testing this modified hash function now. How pissed will the PHC be if I keep re-submitting improved version? Can I blame you :-) It's already passed some of the harder dieharder tests, so it looks like it will be good enough. For coding this up for SIMD, I'm thinking of doing an 4-way memory interleaved version so that all the data in the 1st loop lines up nicely as 128-bit memory accesses. Is this the right approach for making use of SIMD? Thanks, Bill
Powered by blists - more mailing lists