lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 10 Feb 2016 15:18:14 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'George Spelvin' <linux@...izon.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"tom@...bertland.com" <tom@...bertland.com>
CC:	"mingo@...nel.org" <mingo@...nel.org>
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10:	adcq	0(%rdi,%rcx,8),%rax
> > 	inc	%rcx
> > 	jnz	10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.

I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:	adcxq	0(%rdi,%rcx),%rax
> 	adoxq	8(%rdi,%rcx),%rdx
>  	leaq	16(%rcx),%rcx
> 	jrcxz	11f
>  	j	10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

	David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ