lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 22 Apr 2014 00:35:31 +0400
From: Solar Designer <>
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 08:46:01PM +0400, Solar Designer wrote:
> On Thu, Feb 13, 2014 at 11:06:45AM -0500, Bill Cox wrote:
> > In general, I
> > don't think there's a way for the SIMD loop to be as fast as the
> > non-SIMD loop, and this difference will not be seen by an ASIC
> > attacker, so it comes right out of compute-time hardness.
> This is about right, but there appear to be ways around it:
> 1. If we do 32x32->64, then non-SIMD latency is 4 to 5 cycles (for the
> upper 32 bits of result; the lower 32 bits may be ready in "eax" after 3
> cycles).  The PMULUDQ latency is 3 to 5 cycles - potentially even 1
> cycle better than best CPU's non-SIMD equivalent (for upper 32 bits of
> result).  See GenuineIntel0010676_Harpertown_InstLatX64.txt for where
> PMULUDQ wins over scalar (3 cycles vs. 5 cycles latency).

I just found that AuthenticAMD0700F01_K16_Kabini_InstLatX86.txt (AMD
A4-5000 APU at 1.5 GHz) lists "PMULUDQ xmm, xmm" and "VPMULUDQ xmm, xmm,
xmm" as having 2 cycles latency.  While this improvement over 3 cycles
(best across other CPUs) is almost certainly due to the lower clock
rate, it's still impressive.  This same file lists scalar *MUL as 3+
cycles latency, and 4 cycles for high 32 bits.  So the SIMD form of
32x32->64 is twice faster than scalar (in terms of latency) on this CPU
for high 32 bits of result.

> 2. We may do a non-SIMD chain for latency hardening in parallel with a
> bunch of SIMD chains optimized for throughput (to use the multipliers'
> die area optimally and not leave room for attacker to share most of the
> multipliers between cores).  Then even PMULLD on Haswell will be fine
> (not on Avoton, though, where it's also much lower throughput, compared
> to e.g. PMULUDQ).  By having multiple chains work "in parallel" I mean
> use of interleaved instructions (ALU intermixed with SIMD).  Their
> results would need to be mixed together once in a while, such as after
> each block.
> Also, accessing memory via SIMD instructions improves bandwidth, and
> this can only be done with no overhead if the computation is SIMD.


Powered by blists - more mailing lists