phc-discussions - Re: [PHC] wider integer multiply on 32-bit x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <531550CE.9080005@dei.uc.pt>
Date: Tue, 04 Mar 2014 04:04:30 +0000
From: Samuel Neves <sneves@....uc.pt>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] wider integer multiply on 32-bit x86

On 04-03-2014 02:36, Solar Designer wrote:
> On Mon, Mar 03, 2014 at 06:23:32PM -0800, Andy Lutomirski wrote:
>> On Mon, Mar 3, 2014 at 6:13 PM, Solar Designer <solar@...nwall.com> wrote:
>>> I think 4 instructions, including the loads and stores, for a 63x63->63
>>> multiply is rather good.  Without this trick, it'd take 4 _multiplies_
>>> to implement the equivalent via 32x32->32 (or perhaps 3 multiplies if we
>>> also use the 32x32->64).  Some bigint library could use this trick,
>>> perhaps for some nice speedup on those older CPUs/builds (does any use
>>> it already?)
>> I think that Poly1305 and related things use a similar trick, at least
>> on some architectures.
> I thought Poly1305 used IEEE double, so up to 53-bit only, and it's not
> a bigint library.  But yes, it's similar.

Poly1305 used the full x86 64-bit mantissa [1, Section 4]. So did
curve25519 [2]. According to Agner Fog, the floating-point and integer
multipliers in the Pentium III share the same circuitry (and the
floating-point multiplication has one extra cycle of latency), so the
only advantage here would appear to be register usage. Register renaming
does help there.

By the way, in the 63x63 -> 63 case the floating-point unit would
actually be computing the most-significant bits of the product, not the
least-significant. This could be used to compute the xor of both product
halves relatively quickly, with the integer multiplier computing the
lower part, while the floating-point multiplier would compute the higher
part. Like so (warning: not thoroughly tested, but seems to work):

    // Compute 63x63 product, xor high half with low half
    static inline uint64_t fp_mul_hilo(uint64_t a, uint64_t b)
    {
        static const long double twom63 =
1.08420217248550443400745280087E-19;
        static const uint16_t ctrl = 0x1f7f; // 64-bit mantissa, round to 0
        uint64_t high;
        __asm__ __volatile__
        (
            "fldcw   %4\n\t" // Only needed for setup, really
            "fildll  %1\n\t"
            "fildll  %2\n\t"
            "fmulp     \n\t"
            "fldt    %3\n\t"
            "fmulp     \n\t"
            "fistpll %0\n\t"
            : "=m" (high)
            : "m" (a), "m" (b), "m" (twom63), "m" (ctrl)
        );
        return high ^ ((a * b) & 0x7FFFFFFFFFFFFFFFULL);
    }

To be honest, though, I don't think using floating-point is worth it.
The only chip where floating-point convincingly beat integer was the AMD
K7, where you could dispatch a floating-point multiplication every
cycle, but an integer one only once every 3 cycles.

[1] http://cr.yp.to/mac/poly1305-20050329.pdf
[2] http://cr.yp.to/ecdh/curve25519-20060209.pdf