linux-kernel - RE: [patch v3 6/7] crc32: add-slicing-by-8.diff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <011501cc56a9$0b1ca470$2155ed50$@systemfabricworks.com>
Date:	Tue, 9 Aug 2011 10:28:50 -0500
From:	"Bob Pearson" <rpearson@...temfabricworks.com>
To:	"'George Spelvin'" <linux@...izon.com>,
	<akpm@...ux-foundation.org>, <fzago@...temfabricworks.com>,
	<joakim.tjernlund@...nsmode.se>, <linux-kernel@...r.kernel.org>
Subject: RE: [patch v3 6/7] crc32: add-slicing-by-8.diff



> -----Original Message-----
> From: George Spelvin [mailto:linux@...izon.com]
> Sent: Tuesday, August 09, 2011 6:21 AM
> To: akpm@...ux-foundation.org; fzago@...temfabricworks.com;
> joakim.tjernlund@...nsmode.se; linux-kernel@...r.kernel.org;
> linux@...izon.com; rpearson@...temfabricworks.com
> Subject: Re: [patch v3 6/7] crc32: add-slicing-by-8.diff
> 
> While writing up some documentation of this algorithm, I came up with
> a potential speedup.  Or, at least, realized why slicing by more than
> 4 is so much faster than slicing by 4 or less.
> 
> Note that the inner loop of the algorithm is as follows:
> 
> +#  define DO_CRC8a (tab[7][(q) & 255] ^ \
> +		tab[6][(q >> 8) & 255] ^ \
> +		tab[5][(q >> 16) & 255] ^ \
> +		tab[4][(q >> 24) & 255])
> +#  define DO_CRC8b (tab[3][(q) & 255] ^ \
> +		tab[2][(q >> 8) & 255] ^ \
> +		tab[1][(q >> 16) & 255] ^ \
> +		tab[0][(q >> 24) & 255])
> 
> +	for (--b; middle_len; --middle_len) {
> +		u32 q;
> +		q = crc ^ *++b;
> +		crc = DO_CRC8a;
> +		q = *++b;
> +		crc ^= DO_CRC8b;
>  	}
> 
> Note the data dependencies: DO_CRC8a depends on the
> previous crc, which depends on the previous DO_CRC8b.
> But DO_CRC8b does not depend on anything except the
> input data at *++b.

I think you've got it. That is exactly why it works. The hardware is pretty
good at finding the parallelism since this code runs at almost 2X the speed
of the CRC4 loop.

> 
> It would increase parallelism to schedule DO_CRC8b before DO_CRC8a,
> to start those loads before the previous crc value is available.
> 
> Maybe the compiler and/pr processor can find this parallelism already,
> but if not, it might be useful to try reordering it:
> 
> #  define DO_CRC8a(x) (tab[7][(x) & 255] ^ \
> 		tab[6][((x) >> 8) & 255] ^ \
> 		tab[5][((x) >> 16) & 255] ^ \
> 		tab[4][((x) >> 24) & 255])
> #  define DO_CRC8b(x) (tab[3][(x) & 255] ^ \
> 		tab[2][((x) >> 8) & 255] ^ \
> 		tab[1][((x) >> 16) & 255] ^ \
> 		tab[0][((x) >> 24) & 255])
> 
> 	for ( ; middle_len; --middle_len, b += 2) {
> 		u32 q = DO_CRC8b(b[1]);
> 		crc ^= b[0];
> 		crc = q ^ DO_CRC8a(crc);
> 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/