linux-kernel - Re: [RFC] lib/vsprintf.c: Even faster decimal conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK1hOcNsVyHHFX4e+=-_2MrWFnjDedWcO0sS+2gatCeLrV20sw@mail.gmail.com>
Date:	Wed, 18 Mar 2015 01:50:35 +0100
From:	Denys Vlasenko <vda.linux@...glemail.com>
To:	Rasmus Villemoes <linux@...musvillemoes.dk>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	"Peter Zijlstra (Intel)" <peterz@...radead.org>,
	Tejun Heo <tj@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] lib/vsprintf.c: Even faster decimal conversion

On Sat, Feb 21, 2015 at 12:51 AM, Rasmus Villemoes
<linux@...musvillemoes.dk> wrote:
> The most expensive part of decimal conversion is the divisions by 10
> (albeit done using reciprocal multiplication with appropriately chosen
> constants). I decided to see if one could eliminate around half of
> these multiplications by emitting two digits at a time, at the cost of
> a 200 byte lookup table, and it does indeed seem like there is
> something to be gained, especially on 64 bits.
...
...
> +char *put_dec_trunc8(char *buf, unsigned r)
>  {
> +       unsigned q;
> +
> +       /* 1 <= r < 10^8 */
> +       if (r < 100)
> +               goto out_r;
> +
> +       /* 100 <= r < 10^8 */
> +       q = (r * (u64)0x28f5c29) >> 32;
> +       *((u16 *)buf) = decpair[r - 100*q];
> +       buf += 2;
> +
> +       /* 1 <= q < 10^6 */
> +       if (q < 100)
> +               goto out_q;
> +
> +       /*  100 <= q < 10^6 */
> +       r = (q * (u64)0x28f5c29) >> 32;
> +       *((u16 *)buf) = decpair[q - 100*r];
> +       buf += 2;
> +
> +       /* 1 <= r < 10^4 */
> +       if (r < 100)
> +               goto out_r;
> +
> +       /* 100 <= r < 10^4 */
> +       q = (r * 0x147b) >> 19;
> +       *((u16 *)buf) = decpair[r - 100*q];
> +       buf += 2;
> +out_q:
> +       /* 1 <= q < 100 */
> +       r = q;
> +out_r:
> +       /* 1 <= r < 100 */
> +       *((u16 *)buf) = decpair[r];
> +       buf += 2;
> +       if (buf[-1] == '0')
> +               buf--;
>         return buf;
>  }

Your code does four 16-bit stores.
The version below does two 32-bit ones instead,
and it is also marginally smaller.

char *put_dec_full8(char *buf, unsigned r)
{
        unsigned q;
        u32 v;

        /* 0 <= r < 10^8 */
        q = (r * (u64)0x28f5c29) >> 32;
        v = (u32)decpair[r - 100*q] << 16;

        /* 0 <= q < 10^6 */
        r = (q * (u64)0x28f5c29) >> 32;
        v = v | decpair[q - 100*r];
        ((u32*)buf)[0] = v;

        /* 0 <= r < 10^4 */
        q = (r * 0x147b) >> 19;
        v = (u32)decpair[r - 100*q] << 16;

        /* 0 <= q < 100 */
        v = v | decpair[q];
        ((u32*)buf)[1] = v;

        return buf + 8;
}

It may be faster not only because of having fewer stores,
but because on x86, this code (moving 16-bit halves):

        movw    decpair(%ebx,%ebx), %dx
        movw    %dx, 4(%eax)
        movw    decpair(%ecx,%ecx), %dx
        movw    %dx, 6(%eax)

suffers from register merge stall when 16-bit value
is read into lower part of %edx. 32-bit code
has no such stalls:

        movzwl  decpair(%ebx,%ebx), %edx
        sall    $16, %edx
        movzwl  decpair(%ecx,%ecx), %ecx
        orl     %ecx, %edx
        movl    %edx, 4(%eax)


Before you ask:
I didn't manage to extend this trick to a code
with one 64-bit store which is not larger.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/