phc-discussions - Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140213193745.GA7078@openwall.com>
Date: Thu, 13 Feb 2014 23:37:45 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Sun, Feb 09, 2014 at 05:47:29AM +0000, Samuel Neves wrote:
> I managed to shave one instruction (cycle) off in two different ways,
> one requiring SSE4.1, the other only SSE2. Both run at same speed.

Cool!

> One note: since we're doing two PMULUDQ, and they cannot be dispatched
> in the same cycle, the lower bound for the latency of any accurate
> emulation is 6 cycles.

Good point.  However, when we have a sequence of multiple emulated
_mm_mullo_epi32()'s, this extra cycle may be incurred only once per
sequence rather than once per _mm_mullo_epi32().

It's similar to how the scalar widening MUL has (on some CPUs) 1 cycle
higher latency for edx (or rdx) than for eax (or rax).  In the table I
posted, I included the higher latency, but we may have in mind that in
some instruction sequences the lower latency will sometimes apply.

Alexander