linux-kernel - Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120925114425.15269.qmail@science.horizon.com>
Date:	25 Sep 2012 07:44:25 -0400
From:	"George Spelvin" <linux@...izon.com>
To:	linux@...izon.com, mpn@...gle.com, vda.linux@...glemail.com
Cc:	hughd@...gle.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8

Thanks to Denys Vlasenko for sending me his benchmarking code.

I went and hacked on it to ransomize the numbers being converted more,
since repeatedly converting the same number underestimates the number
of branch mispredictions.

Then I tried computing the number of digits beforehand, as mentioned
in my earlier message, and it came out slightly slower.  The code is:

static noinline_for_stack
char *put_dec_trunc8(char *buf, unsigned n)
{
        static unsigned const pow10[9] = { 0, 10, 100, 1000, 10000, 100000,
					 1000000, 10000000, 100000000 };
	unsigned digits = (19 * fls(n) + 6) >> 6;  /* Valid for < 44 bits */
	unsigned t;

	/*
	 * "Digits" is an estimate, always exact or high by 1, of the
	 * number of decimal digits in r.  It is offset by 1, so digits=0
	 * means 1 digit.  Given that the largest value ever passed to
	 * this function has 27 bits, The largest value is (19 * 27 +
	 * 6) >> 6 = 8, meaning 9 digits.  However, that will always be
	 * rounded down because r is less than pow10[8] = 10^8.
	 *
	 * We could use smaller multipliers in cases 4 through 6, but that
	 * would require a shift by less than 32, which is awkward on a
	 * 32-bit processor and wastes more time than the larger multiply.
	 */
	switch (digits - (n < pow10[digits])) {
	default:
		__builtin_unreachable();
	case 7:		/* q <= 99999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 6:		/* q <= 9999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 5:		/* t <= 999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 4:		/* t <= 99999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 3:		/* t <= 9999 */
		t = n + '0';
		n  = (n * 0x199a) >> 16;
		*buf++ = t - 10*n;
	case 2:		/* t <= 999 */
		t = n + '0';
		n  = (n * 0xcd) >> 11;
		*buf++ = t - 10*n;
	case 1:		/* t <= 99 */
		t = n + '0';
		n  = (n * 0xcd) >> 11;
		*buf++ = t - 10*n;
	case 0:		/* t <= 9 */
		*buf++ = n + '0';
	}
	return buf;
}


But!  With more extensive refactoring of the number() code, computing
the number of digits at the beginning can eliminate the entire business
of formatting into tmp[] and copying backward.  We'll first compute the
number of digits, check for buffer overflow, insert the printf padding,
and then call the number-formatting code, passing the number of digits in.

It seems plausible that the resultant simplification will produce a
speedup.

I'm going to experiment with that.  Denys, since I'm playing in your sandbox,
do you have any violent objections to that?  (Assuming, of course, that it's
a net speed win and my surgery to function signatures isn't too hideous.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/