phc-discussions - Re: [PHC] wider integer multiply on 32-bit x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140304185439.GA10772@openwall.com>
Date: Tue, 4 Mar 2014 22:54:39 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] wider integer multiply on 32-bit x86

On Tue, Mar 04, 2014 at 06:40:40AM +0000, Samuel Neves wrote:
> On 04-03-2014 05:58, Solar Designer wrote:
> > So if I do a MUL, immediately save EDX:EAX to other registers, and
> > follow that with another similar MUL, the second MUL would be able to
> > proceed out-of-order, before the first one completes, correct?  (As long
> > as there's no data dependency between the two MULs, indeed.  Only the
> > same ISA registers used, which a renamer might resolve.)
> 
> Correct. Here's an example where it is clearly visible:
> 
>     mul rbx
>     mov rsi, rax
>     mov rax, rbx
>     mul rbx
>     mov rdi, rax
>     mov rax, rbx
> 
> When put in a loop, this sequence consumes ~3.75 Sandy Bridge cycles per
> iteration. If you remove the 'mov eax, ebx' lines, it grows to ~6.5 (due
> to dependencies). There is still some overhead involved: a perfect loop
> should only require ~2 cycles per iteration. Haswell comes very close,
> at 2.12 (due to register moves also being eliminated at the renaming
> phase). I have no old Pentiums around to check how well renaming works
> there, though.

Thanks.  If you send me a complete test program that I can compile and
run on Linux, I'll test on P2, P3, P4.  (No renaming on P1.)

Alexander