linux-kernel - RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCDD01B@AcuExch.aculab.com>
Date:	Wed, 10 Feb 2016 15:18:14 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'George Spelvin' <linux@...izon.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"tom@...bertland.com" <tom@...bertland.com>
CC:	"mingo@...nel.org" <mingo@...nel.org>
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10:	adcq	0(%rdi,%rcx,8),%rax
> > 	inc	%rcx
> > 	jnz	10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.

I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:	adcxq	0(%rdi,%rcx),%rax
> 	adoxq	8(%rdi,%rcx),%rdx
>  	leaq	16(%rcx),%rcx
> 	jrcxz	11f
>  	j	10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

	David