phc-discussions - Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOLP8p4wjSF1Uukaf4EQS3SQ4Nhqr4crgAK-PqLMs_RTjvkWpg@mail.gmail.com>
Date: Thu, 13 Feb 2014 11:06:45 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 10:56 AM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Thu, Feb 13, 2014 at 07:47:28PM +0400, Solar Designer wrote:
> I don't know how exactly you were adding SIMD (as an experiment), but
> please note that occasional data mixing between the lanes is a must,
> since otherwise it becomes possible to compute the whole heavy part of
> the KDF as a few sequential portions, one per lane, with the memory
> needs reduced accordingly (it'll be number-of-lanes times smaller).
> (This is similar to how scrypt's p>1 allows for computation with less
> memory when the parallelism is not fully made use of as such.)
>
> For the random access portion of computation, it is also important that
> the random index depends on output of all lanes in the most recent block
> computed.  Otherwise some of the lanes may finish computation later.
>
> Alexander

It was just a quick hack.  What I did was do 4 parallel lanes of
hashing within a block of size 4096 bytes, so 1024 bytes each.  I
passed the same "value" into all 4 lanes, and at the end XOR-ed the 4
results into value.  That mixed between lanes between every block,
though in a non-cryptographically strong way.

On the positive side, this increased data throughput about 2X, but it
also had about 2X higher latency per loop iteration, which is about
what I expected.  The overall result is about 2X the bandwidth and
about 1/2 the multiplication hardness, for a given runtime.  However,
doing 2 threads instead gets maybe 80% higher bandwidth with around
20% lower multiplication hardness for a given runtime.  In general, I
don't think there's a way for the SIMD loop to be as fast as the
non-SIMD loop, and this difference will not be seen by an ASIC
attacker, so it comes right out of compute-time hardness.

It's not clear to me what a reasonable tradeoff is.  I think 2X
bandwidth for 1/2 multiplication hardness is probably a win, but the 2
thread version does better than that.  It might be best just to gain
parallelism through multi-threading rather than SIMD, unless I'm
mistaken about the 2X-ish increase in loop latency.

Bill