[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531652C8.7050308@dei.uc.pt>
Date: Tue, 04 Mar 2014 22:25:12 +0000
From: Samuel Neves <sneves@....uc.pt>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] wider integer multiply on 32-bit x86
On 04-03-2014 13:45, Steve Thomas wrote:
> > On March 3, 2014 at 10:04 PM Samuel Neves <sneves@....uc.pt> wrote:
> >
> > static const uint16_t ctrl = 0x1f7f; // 64-bit mantissa, round to 0
>
> Nice with that you can do umul64_hi with one fmul and one fscale:
>
> #define __STDC_CONSTANT_MACROS
> #include <inttypes.h>
>
> uint64_t umul64_hi(uint64_t a, uint64_t b)
> {
> uint64_t r;
> float bit63 = 9223372036854775808.0; // 2^63
> int32_t shift = -64;
> uint16_t ctrl = 0x1f7f;
>
> a -= UINT64_C(0x8000000000000000);
> b -= UINT64_C(0x8000000000000000);
> asm(
> "fldcw %4 \n\t"
> "fildl %3 \n\t"
> "fildll %1 \n\t"
> "fadd %5 \n\t"
> "fildll %2 \n\t"
> "fadd %5 \n\t"
> "fmulp \n\t"
> "fscale \n\t"
> "fsub %5 \n\t"
> "fistpll %0 \n\t"
> "ffree %%st\n\t"
> : "=m"(r)
> : "m"(a), "m"(b), "m"(shift), "m"(ctrl), "m"(bit63));
> return r + UINT64_C(0x8000000000000000);
> }
>
> A few notes on the code fildll loads a signed 64 bit int so you need to
> subtract 2^63 and load then add 2.0^63 to make it an unsigned load and
> the
> reverse on the store. ffree is needed because after 6 calls the stack
> is full
> or something and gives bad data. Lastly I have never done x87 before
> today.
> So there's probably a better way to do it. Also this stack thing I
> don't get
> it besides: there are 8 registers, it's a cyclical stack, you push and
> pop,
> it breaks if you push to much instead of overwriting oldest, there's a
> notion
> of free that isn't a pop but works like it?, and you can change the stack
> pointer by +-1 and somehow doesn't make it implode.
FSCALE is nice, but it is a microcoded instruction which has something
like 40 cycles of latency on Pentium III/M, which sadly makes it much
worse than FMUL. For your 64-bit (instead of 63-bit) multiplication to
work, we need to adjust the rounding mode towards minus infinity,
instead of truncating. Here's a slightly tweaked version that (I hope)
is correct:
static inline uint64_t fp_mul64_hi(uint64_t a, uint64_t b)
{
static const double twom64 =
5.4210108624275221700372640043497085571289062500000E-20;
static const uint16_t ctrl = 0x077f; // 64-bit mantissa, round
to -infinity
uint64_t high, sum = 0;
__asm__ __volatile__
(
"fldcw %4\n\t" // setuprounding mode
"fildll %1\n\t"
"fildll %2\n\t"
"fmulp \n\t"
"fmull %3\n\t"
"fistpll %0\n\t"
: "=m" (high)
: "m" (a), "m" (b), "m" (twom64), "m" (ctrl)
);
#if defined(CONSTANT_TIME)// Computation of sum can run interleaved
with FP
sum = (a & -(b >> 63)) + (b & -(a >> 63));
#else
if(a & (1ULL << 63)) sum += b;
if(b & (1ULL << 63)) sum += a;
#endif
return high + sum;
}
Content of type "text/html" skipped
Powered by blists - more mailing lists