netdev - RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCD63C9@AcuExch.aculab.com>
Date:	Thu, 4 Feb 2016 11:08:45 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Tom Herbert' <tom@...bertland.com>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
	"kernel-team@...com" <kernel-team@...com>
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> +	/* Main loop */
> +50:	adcq	0*8(%rdi),%rax
> +	adcq	1*8(%rdi),%rax
> +	adcq	2*8(%rdi),%rax
> +	adcq	3*8(%rdi),%rax
> +	adcq	4*8(%rdi),%rax
> +	adcq	5*8(%rdi),%rax
> +	adcq	6*8(%rdi),%rax
> +	adcq	7*8(%rdi),%rax
> +	adcq	8*8(%rdi),%rax
> +	adcq	9*8(%rdi),%rax
> +	adcq	10*8(%rdi),%rax
> +	adcq	11*8(%rdi),%rax
> +	adcq	12*8(%rdi),%rax
> +	adcq	13*8(%rdi),%rax
> +	adcq	14*8(%rdi),%rax
> +	adcq	15*8(%rdi),%rax
> +	lea	128(%rdi), %rdi
> +	loop	50b

I'd need convincing that unrolling the loop like that gives any significant gain.
You have a dependency chain on the carry flag so have delays between the 'adcq'
instructions (these may be more significant than the memory reads from l1 cache).

I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
If 'loop' is fast then you will probably find that:

10:	adcq 0(%rdi),%rax
	lea  8(%rdi),%rdi
	loop 10b

is just as fast since the three instructions could all be executed in parallel.
But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
a single micro-op).
IIRC 'adc' and 'dec' will both have dependencies on the flags register
so cannot execute together (which is a shame here).

It is also possible that breaking the carry-chain dependency by doing 32bit
adds (possibly after 64bit reads) can be made to be faster.

	David