lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 13 Feb 2014 11:06:45 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission) On Thu, Feb 13, 2014 at 10:56 AM, Solar Designer <solar@...nwall.com> wrote: > Bill, > > On Thu, Feb 13, 2014 at 07:47:28PM +0400, Solar Designer wrote: > I don't know how exactly you were adding SIMD (as an experiment), but > please note that occasional data mixing between the lanes is a must, > since otherwise it becomes possible to compute the whole heavy part of > the KDF as a few sequential portions, one per lane, with the memory > needs reduced accordingly (it'll be number-of-lanes times smaller). > (This is similar to how scrypt's p>1 allows for computation with less > memory when the parallelism is not fully made use of as such.) > > For the random access portion of computation, it is also important that > the random index depends on output of all lanes in the most recent block > computed. Otherwise some of the lanes may finish computation later. > > Alexander It was just a quick hack. What I did was do 4 parallel lanes of hashing within a block of size 4096 bytes, so 1024 bytes each. I passed the same "value" into all 4 lanes, and at the end XOR-ed the 4 results into value. That mixed between lanes between every block, though in a non-cryptographically strong way. On the positive side, this increased data throughput about 2X, but it also had about 2X higher latency per loop iteration, which is about what I expected. The overall result is about 2X the bandwidth and about 1/2 the multiplication hardness, for a given runtime. However, doing 2 threads instead gets maybe 80% higher bandwidth with around 20% lower multiplication hardness for a given runtime. In general, I don't think there's a way for the SIMD loop to be as fast as the non-SIMD loop, and this difference will not be seen by an ASIC attacker, so it comes right out of compute-time hardness. It's not clear to me what a reasonable tradeoff is. I think 2X bandwidth for 1/2 multiplication hardness is probably a win, but the 2 thread version does better than that. It might be best just to gain parallelism through multi-threading rather than SIMD, unless I'm mistaken about the 2X-ish increase in loop latency. Bill
Powered by blists - more mailing lists