linux-kernel - Re: [RFC] lib/vsprintf.c: Even faster decimal conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87r3sm2pvk.fsf@rasmusvillemoes.dk>
Date:	Wed, 18 Mar 2015 18:10:23 +0100
From:	Rasmus Villemoes <linux@...musvillemoes.dk>
To:	Denys Vlasenko <vda.linux@...glemail.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	"Peter Zijlstra \(Intel\)" <peterz@...radead.org>,
	Tejun Heo <tj@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] lib/vsprintf.c: Even faster decimal conversion

On Wed, Mar 18 2015, Denys Vlasenko <vda.linux@...glemail.com> wrote:


> Your code does four 16-bit stores.
> The version below does two 32-bit ones instead,
> and it is also marginally smaller.
>
> char *put_dec_full8(char *buf, unsigned r)
> {
>         unsigned q;
>         u32 v;
>
>         /* 0 <= r < 10^8 */
>         q = (r * (u64)0x28f5c29) >> 32;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 10^6 */
>         r = (q * (u64)0x28f5c29) >> 32;
>         v = v | decpair[q - 100*r];
>         ((u32*)buf)[0] = v;
>
>         /* 0 <= r < 10^4 */
>         q = (r * 0x147b) >> 19;
>         v = (u32)decpair[r - 100*q] << 16;
>
>         /* 0 <= q < 100 */
>         v = v | decpair[q];
>         ((u32*)buf)[1] = v;
>
>         return buf + 8;
> }
>
> It may be faster not only because of having fewer stores,
> but because on x86, this code (moving 16-bit halves):
>
>         movw    decpair(%ebx,%ebx), %dx
>         movw    %dx, 4(%eax)
>         movw    decpair(%ecx,%ecx), %dx
>         movw    %dx, 6(%eax)
>
> suffers from register merge stall when 16-bit value
> is read into lower part of %edx. 32-bit code
> has no such stalls:
>
>         movzwl  decpair(%ebx,%ebx), %edx
>         sall    $16, %edx
>         movzwl  decpair(%ecx,%ecx), %ecx
>         orl     %ecx, %edx
>         movl    %edx, 4(%eax)
>

[On little-endian, I'm pretty sure the <<16 should be applied to the
second and fourth decpair value.]

Thanks for the suggestion. However, I don't see any change in the size
of the generated code (gcc 4.7), and, at least on my Xeon machine,
converting both ULONG_MAX and uniformly random u64s becomes slightly
slower (54 vs 56 cycles and 61 vs 65 cycles).

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/