[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140304185439.GA10772@openwall.com>
Date: Tue, 4 Mar 2014 22:54:39 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] wider integer multiply on 32-bit x86
On Tue, Mar 04, 2014 at 06:40:40AM +0000, Samuel Neves wrote:
> On 04-03-2014 05:58, Solar Designer wrote:
> > So if I do a MUL, immediately save EDX:EAX to other registers, and
> > follow that with another similar MUL, the second MUL would be able to
> > proceed out-of-order, before the first one completes, correct? (As long
> > as there's no data dependency between the two MULs, indeed. Only the
> > same ISA registers used, which a renamer might resolve.)
>
> Correct. Here's an example where it is clearly visible:
>
> mul rbx
> mov rsi, rax
> mov rax, rbx
> mul rbx
> mov rdi, rax
> mov rax, rbx
>
> When put in a loop, this sequence consumes ~3.75 Sandy Bridge cycles per
> iteration. If you remove the 'mov eax, ebx' lines, it grows to ~6.5 (due
> to dependencies). There is still some overhead involved: a perfect loop
> should only require ~2 cycles per iteration. Haswell comes very close,
> at 2.12 (due to register moves also being eliminated at the renaming
> phase). I have no old Pentiums around to check how well renaming works
> there, though.
Thanks. If you send me a complete test program that I can compile and
run on Linux, I'll test on P2, P3, P4. (No renaming on P1.)
Alexander
Powered by blists - more mailing lists