lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Feb 2014 11:06:45 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 10:56 AM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Thu, Feb 13, 2014 at 07:47:28PM +0400, Solar Designer wrote:
> I don't know how exactly you were adding SIMD (as an experiment), but
> please note that occasional data mixing between the lanes is a must,
> since otherwise it becomes possible to compute the whole heavy part of
> the KDF as a few sequential portions, one per lane, with the memory
> needs reduced accordingly (it'll be number-of-lanes times smaller).
> (This is similar to how scrypt's p>1 allows for computation with less
> memory when the parallelism is not fully made use of as such.)
>
> For the random access portion of computation, it is also important that
> the random index depends on output of all lanes in the most recent block
> computed.  Otherwise some of the lanes may finish computation later.
>
> Alexander

It was just a quick hack.  What I did was do 4 parallel lanes of
hashing within a block of size 4096 bytes, so 1024 bytes each.  I
passed the same "value" into all 4 lanes, and at the end XOR-ed the 4
results into value.  That mixed between lanes between every block,
though in a non-cryptographically strong way.

On the positive side, this increased data throughput about 2X, but it
also had about 2X higher latency per loop iteration, which is about
what I expected.  The overall result is about 2X the bandwidth and
about 1/2 the multiplication hardness, for a given runtime.  However,
doing 2 threads instead gets maybe 80% higher bandwidth with around
20% lower multiplication hardness for a given runtime.  In general, I
don't think there's a way for the SIMD loop to be as fast as the
non-SIMD loop, and this difference will not be seen by an ASIC
attacker, so it comes right out of compute-time hardness.

It's not clear to me what a reasonable tradeoff is.  I think 2X
bandwidth for 1/2 multiplication hardness is probably a win, but the 2
thread version does better than that.  It might be best just to gain
parallelism through multi-threading rather than SIMD, unless I'm
mistaken about the 2X-ish increase in loop latency.

Bill

Powered by blists - more mailing lists