phc-discussions - Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140421203531.GA29289@openwall.com>
Date: Tue, 22 Apr 2014 00:35:31 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 08:46:01PM +0400, Solar Designer wrote:
> On Thu, Feb 13, 2014 at 11:06:45AM -0500, Bill Cox wrote:
> > In general, I
> > don't think there's a way for the SIMD loop to be as fast as the
> > non-SIMD loop, and this difference will not be seen by an ASIC
> > attacker, so it comes right out of compute-time hardness.
> 
> This is about right, but there appear to be ways around it:
> 
> 1. If we do 32x32->64, then non-SIMD latency is 4 to 5 cycles (for the
> upper 32 bits of result; the lower 32 bits may be ready in "eax" after 3
> cycles).  The PMULUDQ latency is 3 to 5 cycles - potentially even 1
> cycle better than best CPU's non-SIMD equivalent (for upper 32 bits of
> result).  See GenuineIntel0010676_Harpertown_InstLatX64.txt for where
> PMULUDQ wins over scalar (3 cycles vs. 5 cycles latency).

I just found that AuthenticAMD0700F01_K16_Kabini_InstLatX86.txt (AMD
A4-5000 APU at 1.5 GHz) lists "PMULUDQ xmm, xmm" and "VPMULUDQ xmm, xmm,
xmm" as having 2 cycles latency.  While this improvement over 3 cycles
(best across other CPUs) is almost certainly due to the lower clock
rate, it's still impressive.  This same file lists scalar *MUL as 3+
cycles latency, and 4 cycles for high 32 bits.  So the SIMD form of
32x32->64 is twice faster than scalar (in terms of latency) on this CPU
for high 32 bits of result.

> 2. We may do a non-SIMD chain for latency hardening in parallel with a
> bunch of SIMD chains optimized for throughput (to use the multipliers'
> die area optimally and not leave room for attacker to share most of the
> multipliers between cores).  Then even PMULLD on Haswell will be fine
> (not on Avoton, though, where it's also much lower throughput, compared
> to e.g. PMULUDQ).  By having multiple chains work "in parallel" I mean
> use of interleaved instructions (ALU intermixed with SIMD).  Their
> results would need to be mixed together once in a while, such as after
> each block.
> 
> Also, accessing memory via SIMD instructions improves bandwidth, and
> this can only be done with no overhead if the computation is SIMD.

Alexander