netdev - RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 3 Mar 2016 16:12:16 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Tom Herbert' <tom@...bertland.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:	"torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
	"kernel-team@...com" <kernel-team@...com>
Subject: RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> +	/* Main loop using 64byte blocks */
> +	for (; len > 64; len -= 64, buff += 64) {
> +		asm("addq 0*8(%[src]),%[res]\n\t"
> +		    "adcq 1*8(%[src]),%[res]\n\t"
> +		    "adcq 2*8(%[src]),%[res]\n\t"
> +		    "adcq 3*8(%[src]),%[res]\n\t"
> +		    "adcq 4*8(%[src]),%[res]\n\t"
> +		    "adcq 5*8(%[src]),%[res]\n\t"
> +		    "adcq 6*8(%[src]),%[res]\n\t"
> +		    "adcq 7*8(%[src]),%[res]\n\t"
> +		    "adcq $0,%[res]"
> +		    : [res] "=r" (result)
> +		    : [src] "r" (buff),
> +		    "[res]" (result));

Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?

...
> +	/* Sum over any remaining bytes (< 8 of them) */
> +	if (len & 0x7) {
> +		unsigned long val;
> +		/*
> +		 * Since "len" is > 8 here we backtrack in the buffer to load
> +		 * the outstanding bytes into the low order bytes of a quad and
> +		 * then shift to extract the relevant bytes. By doing this we
> +		 * avoid additional calls to load_unaligned_zeropad.

That comment is wrong. Maybe:
		 * Read the last 8 bytes of the buffer then shift to extract
		 * the required bytes.
		 * This is safe because the original length was > 8 and avoids
		 * any problems reading beyond the end of the valid data.

	David