lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Feb 2014 20:46:02 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 11:06:45AM -0500, Bill Cox wrote:
> In general, I
> don't think there's a way for the SIMD loop to be as fast as the
> non-SIMD loop, and this difference will not be seen by an ASIC
> attacker, so it comes right out of compute-time hardness.

This is about right, but there appear to be ways around it:

1. If we do 32x32->64, then non-SIMD latency is 4 to 5 cycles (for the
upper 32 bits of result; the lower 32 bits may be ready in "eax" after 3
cycles).  The PMULUDQ latency is 3 to 5 cycles - potentially even 1
cycle better than best CPU's non-SIMD equivalent (for upper 32 bits of
result).  See GenuineIntel0010676_Harpertown_InstLatX64.txt for where
PMULUDQ wins over scalar (3 cycles vs. 5 cycles latency).

2. We may do a non-SIMD chain for latency hardening in parallel with a
bunch of SIMD chains optimized for throughput (to use the multipliers'
die area optimally and not leave room for attacker to share most of the
multipliers between cores).  Then even PMULLD on Haswell will be fine
(not on Avoton, though, where it's also much lower throughput, compared
to e.g. PMULUDQ).  By having multiple chains work "in parallel" I mean
use of interleaved instructions (ALU intermixed with SIMD).  Their
results would need to be mixed together once in a while, such as after
each block.

Also, accessing memory via SIMD instructions improves bandwidth, and
this can only be done with no overhead if the computation is SIMD.

Alexander

Powered by blists - more mailing lists