phc-discussions - FMA (Re: [PHC] Initial (non-proof-read) NeolKDF paper)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140204001406.GA4574@openwall.com>
Date: Tue, 4 Feb 2014 04:14:06 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: FMA (Re: [PHC] Initial (non-proof-read) NeolKDF paper)

Bill,

On Sun, Jan 26, 2014 at 12:43:33AM -0500, Bill Cox wrote:
>     value = value*(mem[prevAddr++] | 3) + mem[randAddr++]

Note that while this could be a fused multiply-add (FMA) operation on
architectures that have one, your choice for which of the three inputs
to replace makes it incompatible with some architectures that do have
FMA - e.g., with Epiphany, and probably with more.  AMD addresses this
by supporting a 4-operand FMA (making the output separate from the 3
inputs), Intel addresses this by providing multiple forms of 3-operand
FMA (they're currently floating-point only, though), but some archs
don't address this issue (at least Epiphany does not, and I expect many
more examples may be found if we look for them).  If we want to make
this FMA-friendly, then it'd best to replace the same output that is
replaced when doing matrix multiplication, since this is what all FMA
implementations will support.  e.g. this would work:

	value += mem[prevAddr++] | 3) * mem[randAddr++];

but perhaps it does not meet your other requirements, or does it?

On the other hand, one of the operands to FMA is being read from memory
(the other has to pass through OR first, so is in a register by the time
we reach FMA).  If the arch lacks load-op instructions (or at least
lacks such FMA), there will be load instructions into registers, which
may in turn mean that your original hash function is just as optimal: a
temporarily used register holding mem[randAddr] may be getting updated.
It could be something like:

	value2 = mem[randAddr++];
	value2 += value1*(mem[prevAddr++] | 3);
	mem[toAddr++] = value2;
	value1 = mem[randAddr++];
	value1 += value2*(mem[prevAddr++] | 3);
	mem[toAddr++] = value1;

in 2x unrolled loop.  Then it fits canonical FMA as-is.

So the "problem" applies only to archs that have only 3-operand FMA (not
4-operand), don't have an instruction form that would update one of the
factors (would be a non-canonical FMA since it is not one needed for
matrix multiplication), yet do support reading one of the factors from
memory in this FMA instruction.  I'd guess such archs do exist, but I
admit the "problem" is even more limited than I had thought when I
started writing this message. ;-)

Alexander