[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140421203531.GA29289@openwall.com>
Date: Tue, 22 Apr 2014 00:35:31 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)
On Thu, Feb 13, 2014 at 08:46:01PM +0400, Solar Designer wrote:
> On Thu, Feb 13, 2014 at 11:06:45AM -0500, Bill Cox wrote:
> > In general, I
> > don't think there's a way for the SIMD loop to be as fast as the
> > non-SIMD loop, and this difference will not be seen by an ASIC
> > attacker, so it comes right out of compute-time hardness.
>
> This is about right, but there appear to be ways around it:
>
> 1. If we do 32x32->64, then non-SIMD latency is 4 to 5 cycles (for the
> upper 32 bits of result; the lower 32 bits may be ready in "eax" after 3
> cycles). The PMULUDQ latency is 3 to 5 cycles - potentially even 1
> cycle better than best CPU's non-SIMD equivalent (for upper 32 bits of
> result). See GenuineIntel0010676_Harpertown_InstLatX64.txt for where
> PMULUDQ wins over scalar (3 cycles vs. 5 cycles latency).
I just found that AuthenticAMD0700F01_K16_Kabini_InstLatX86.txt (AMD
A4-5000 APU at 1.5 GHz) lists "PMULUDQ xmm, xmm" and "VPMULUDQ xmm, xmm,
xmm" as having 2 cycles latency. While this improvement over 3 cycles
(best across other CPUs) is almost certainly due to the lower clock
rate, it's still impressive. This same file lists scalar *MUL as 3+
cycles latency, and 4 cycles for high 32 bits. So the SIMD form of
32x32->64 is twice faster than scalar (in terms of latency) on this CPU
for high 32 bits of result.
> 2. We may do a non-SIMD chain for latency hardening in parallel with a
> bunch of SIMD chains optimized for throughput (to use the multipliers'
> die area optimally and not leave room for attacker to share most of the
> multipliers between cores). Then even PMULLD on Haswell will be fine
> (not on Avoton, though, where it's also much lower throughput, compared
> to e.g. PMULUDQ). By having multiple chains work "in parallel" I mean
> use of interleaved instructions (ALU intermixed with SIMD). Their
> results would need to be mixed together once in a while, such as after
> each block.
>
> Also, accessing memory via SIMD instructions improves bandwidth, and
> this can only be done with no overhead if the computation is SIMD.
Alexander
Powered by blists - more mailing lists