lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 9 Aug 2011 10:28:50 -0500
From:	"Bob Pearson" <rpearson@...temfabricworks.com>
To:	"'George Spelvin'" <linux@...izon.com>,
	<akpm@...ux-foundation.org>, <fzago@...temfabricworks.com>,
	<joakim.tjernlund@...nsmode.se>, <linux-kernel@...r.kernel.org>
Subject: RE: [patch v3 6/7] crc32: add-slicing-by-8.diff



> -----Original Message-----
> From: George Spelvin [mailto:linux@...izon.com]
> Sent: Tuesday, August 09, 2011 6:21 AM
> To: akpm@...ux-foundation.org; fzago@...temfabricworks.com;
> joakim.tjernlund@...nsmode.se; linux-kernel@...r.kernel.org;
> linux@...izon.com; rpearson@...temfabricworks.com
> Subject: Re: [patch v3 6/7] crc32: add-slicing-by-8.diff
> 
> While writing up some documentation of this algorithm, I came up with
> a potential speedup.  Or, at least, realized why slicing by more than
> 4 is so much faster than slicing by 4 or less.
> 
> Note that the inner loop of the algorithm is as follows:
> 
> +#  define DO_CRC8a (tab[7][(q) & 255] ^ \
> +		tab[6][(q >> 8) & 255] ^ \
> +		tab[5][(q >> 16) & 255] ^ \
> +		tab[4][(q >> 24) & 255])
> +#  define DO_CRC8b (tab[3][(q) & 255] ^ \
> +		tab[2][(q >> 8) & 255] ^ \
> +		tab[1][(q >> 16) & 255] ^ \
> +		tab[0][(q >> 24) & 255])
> 
> +	for (--b; middle_len; --middle_len) {
> +		u32 q;
> +		q = crc ^ *++b;
> +		crc = DO_CRC8a;
> +		q = *++b;
> +		crc ^= DO_CRC8b;
>  	}
> 
> Note the data dependencies: DO_CRC8a depends on the
> previous crc, which depends on the previous DO_CRC8b.
> But DO_CRC8b does not depend on anything except the
> input data at *++b.

I think you've got it. That is exactly why it works. The hardware is pretty
good at finding the parallelism since this code runs at almost 2X the speed
of the CRC4 loop.

> 
> It would increase parallelism to schedule DO_CRC8b before DO_CRC8a,
> to start those loads before the previous crc value is available.
> 
> Maybe the compiler and/pr processor can find this parallelism already,
> but if not, it might be useful to try reordering it:
> 
> #  define DO_CRC8a(x) (tab[7][(x) & 255] ^ \
> 		tab[6][((x) >> 8) & 255] ^ \
> 		tab[5][((x) >> 16) & 255] ^ \
> 		tab[4][((x) >> 24) & 255])
> #  define DO_CRC8b(x) (tab[3][(x) & 255] ^ \
> 		tab[2][((x) >> 8) & 255] ^ \
> 		tab[1][((x) >> 16) & 255] ^ \
> 		tab[0][((x) >> 24) & 255])
> 
> 	for ( ; middle_len; --middle_len, b += 2) {
> 		u32 q = DO_CRC8b(b[1]);
> 		crc ^= b[0];
> 		crc = q ^ DO_CRC8a(crc);
> 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ