[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOLP8p4wjSF1Uukaf4EQS3SQ4Nhqr4crgAK-PqLMs_RTjvkWpg@mail.gmail.com>
Date: Thu, 13 Feb 2014 11:06:45 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)
On Thu, Feb 13, 2014 at 10:56 AM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Thu, Feb 13, 2014 at 07:47:28PM +0400, Solar Designer wrote:
> I don't know how exactly you were adding SIMD (as an experiment), but
> please note that occasional data mixing between the lanes is a must,
> since otherwise it becomes possible to compute the whole heavy part of
> the KDF as a few sequential portions, one per lane, with the memory
> needs reduced accordingly (it'll be number-of-lanes times smaller).
> (This is similar to how scrypt's p>1 allows for computation with less
> memory when the parallelism is not fully made use of as such.)
>
> For the random access portion of computation, it is also important that
> the random index depends on output of all lanes in the most recent block
> computed. Otherwise some of the lanes may finish computation later.
>
> Alexander
It was just a quick hack. What I did was do 4 parallel lanes of
hashing within a block of size 4096 bytes, so 1024 bytes each. I
passed the same "value" into all 4 lanes, and at the end XOR-ed the 4
results into value. That mixed between lanes between every block,
though in a non-cryptographically strong way.
On the positive side, this increased data throughput about 2X, but it
also had about 2X higher latency per loop iteration, which is about
what I expected. The overall result is about 2X the bandwidth and
about 1/2 the multiplication hardness, for a given runtime. However,
doing 2 threads instead gets maybe 80% higher bandwidth with around
20% lower multiplication hardness for a given runtime. In general, I
don't think there's a way for the SIMD loop to be as fast as the
non-SIMD loop, and this difference will not be seen by an ASIC
attacker, so it comes right out of compute-time hardness.
It's not clear to me what a reasonable tradeoff is. I think 2X
bandwidth for 1/2 multiplication hardness is probably a win, but the 2
thread version does better than that. It might be best just to gain
parallelism through multi-threading rather than SIMD, unless I'm
mistaken about the 2X-ish increase in loop latency.
Bill
Powered by blists - more mailing lists