linux-kernel - Re: [patch V3] lib: GCD: add binary GCD algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160428191551.17438.qmail@ns.horizon.com>
Date:	28 Apr 2016 15:15:51 -0400
From:	"George Spelvin" <linux@...izon.com>
To:	dalias@...c.org, geert@...ux-m68k.org
Cc:	akpm@...ux-foundation.org, davem@...emloft.net, deller@....de,
	ink@...assic.park.msu.ru, james.hogan@...tec.com,
	jejb@...isc-linux.org, jonas@...thpole.se, lennox.wu@...il.com,
	lftan@...era.com, linux-alpha@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	linux-m68k@...ts.linux-m68k.org, linux-metag@...r.kernel.org,
	linux-mips@...ux-mips.org, linux-parisc@...r.kernel.org,
	linux-sh@...r.kernel.org, linux@....linux.org.uk,
	linux@...izon.com, linux@...ts.openrisc.net, liqin.linux@...il.com,
	mattst88@...il.com, monstr@...str.eu,
	nios2-dev@...ts.rocketboards.org, peterz@...radead.org,
	ralf@...ux-mips.org, rth@...ddle.net, sparclinux@...r.kernel.org,
	uclinux-h8-devel@...ts.sourceforge.jp, ysato@...rs.sourceforge.jp,
	zengzhaoxiu@....com, zhaoxiu.zeng@...il.com
Subject: Re: [patch V3] lib: GCD: add binary GCD algorithm

> How does a CPU lack an efficient ffs/ctz anyway? There are all sorts
> of ways to implement it without a native insn, some of which are
> almost or just as fast as the native insn on cpus that have the
> latter. On anything with a fast multiply, the de Bruijn sequence
> approach is near-optimal, and otherwise one of the binary-search type
> approaches (possibly branchless) can be used. If the compiler doesn't
> generate an appropriate one for __builtin_ctz, that's arguably a
> compiler bug.

What's wanted here is something faster than any of those.
Yes, there's a simple constant-time branch-free implementation:

unsigned inline __attribute__((const))
hweight32(uint32_t x)
{
	x -= (x >> 1) & 0x55555555;
	x  = ((x >> 2) & 0x33333333) + (x & 0x33333333);
	x += x >> 4;
	x &= 0x0f0f0f0f;
	x += x >> 8;
	x += x >> 16;
	return x & 63;
}

unsigned inline __attribute__((const))
__ffs32(uint32_t x)
{
	return hweight(~x & (x-1));
}

but if you work it through, that's about 19 instructions; a few more on
platforms without 32-bit immediates.  The shift itself makes an even 20,
and there are a lot of sequential dependencies (I count a 17-op chain
including the shift) limiting execution time.

The de Bruijn hack reduces the length but adds a memory access for
the table lookup.  (http://supertech.csail.mit.edu/papers/debruijn.pdf)

In the GCD code, the number to normalize is basically random, so the
normalization loop shifts an average of 1 bit.  One bit half the time,
a second bit 1/4 of the time, etc.

(The posted code in the FAST_FFS case omits one guaranteed shift at the
end of the loop because the normalization code is constant-time.)

So "fast __ffs" basically means faster than *one* iteration of
"while (!(x & 1)) x >>= 1;".

In this case "fast" means cheaper than *one* unpredictable branch, which
is a very small handful of instructions.