phc-discussions - Re: [PHC] wider integer multiply on 32-bit x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <531652C8.7050308@dei.uc.pt>
Date: Tue, 04 Mar 2014 22:25:12 +0000
From: Samuel Neves <sneves@....uc.pt>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] wider integer multiply on 32-bit x86

On 04-03-2014 13:45, Steve Thomas wrote:
> > On March 3, 2014 at 10:04 PM Samuel Neves <sneves@....uc.pt> wrote:
> >
> > static const uint16_t ctrl = 0x1f7f; // 64-bit mantissa, round to 0
>
> Nice with that you can do umul64_hi with one fmul and one fscale:
>
> #define __STDC_CONSTANT_MACROS
> #include <inttypes.h>
>
> uint64_t umul64_hi(uint64_t a, uint64_t b)
> {
>     uint64_t r;
>     float    bit63 = 9223372036854775808.0; // 2^63
>     int32_t  shift = -64;
>     uint16_t ctrl  = 0x1f7f;
>
>     a -= UINT64_C(0x8000000000000000);
>     b -= UINT64_C(0x8000000000000000);
>     asm(
>         "fldcw   %4  \n\t"
>         "fildl   %3  \n\t"
>         "fildll  %1  \n\t"
>         "fadd    %5  \n\t"
>         "fildll  %2  \n\t"
>         "fadd    %5  \n\t"
>         "fmulp       \n\t"
>         "fscale      \n\t"
>         "fsub    %5  \n\t"
>         "fistpll %0  \n\t"
>         "ffree   %%st\n\t"
>         : "=m"(r)
>         : "m"(a), "m"(b), "m"(shift), "m"(ctrl), "m"(bit63));
>     return r + UINT64_C(0x8000000000000000);
> }
>
> A few notes on the code fildll loads a signed 64 bit int so you need to
> subtract 2^63 and load then add 2.0^63 to make it an unsigned load and
> the
> reverse on the store. ffree is needed because after 6 calls the stack
> is full
> or something and gives bad data. Lastly I have never done x87 before
> today.
> So there's probably a better way to do it. Also this stack thing I
> don't get
> it besides: there are 8 registers, it's a cyclical stack, you push and
> pop,
> it breaks if you push to much instead of overwriting oldest, there's a
> notion
> of free that isn't a pop but works like it?, and you can change the stack
> pointer by +-1 and somehow doesn't make it implode.


FSCALE is nice, but it is a microcoded instruction which has something
like 40 cycles of latency on Pentium III/M, which sadly makes it much
worse than FMUL. For your 64-bit (instead of 63-bit) multiplication to
work, we need to adjust the rounding mode towards minus infinity,
instead of truncating. Here's a slightly tweaked version that (I hope)
is correct:

    static inline uint64_t fp_mul64_hi(uint64_t a, uint64_t b)
    {
        static const double twom64 =
5.4210108624275221700372640043497085571289062500000E-20;
        static const uint16_t ctrl = 0x077f; // 64-bit mantissa, round
to -infinity
        uint64_t high, sum = 0;
        __asm__ __volatile__
        (
            "fldcw   %4\n\t" // setuprounding mode
            "fildll  %1\n\t"
            "fildll  %2\n\t"
            "fmulp     \n\t"
            "fmull   %3\n\t"
            "fistpll %0\n\t"
            : "=m" (high)
            : "m" (a), "m" (b), "m" (twom64), "m" (ctrl)
        );
    #if defined(CONSTANT_TIME)// Computation of sum can run interleaved
with FP
        sum = (a & -(b >> 63)) + (b & -(a >> 63));
    #else
        if(a & (1ULL << 63)) sum += b;
        if(b & (1ULL << 63)) sum += a;
    #endif
        return high + sum;
    }



Content of type "text/html" skipped