netdev - RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCD6923@AcuExch.aculab.com>
Date:	Thu, 4 Feb 2016 17:09:53 +0000
From:	David Laight <David.Laight@...LAB.COM>
To:	'Tom Herbert' <tom@...bertland.com>,
	Alexander Duyck <alexander.duyck@...il.com>
CC:	"davem@...emloft.net" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>,
	"hpa@...or.com" <hpa@...or.com>, "x86@...nel.org" <x86@...nel.org>,
	"kernel-team@...com" <kernel-team@...com>
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

From: Tom Herbert
...
> > If nothing else reducing the size of this main loop may be desirable.
> > I know the newer x86 is supposed to have a loop buffer so that it can
> > basically loop on already decoded instructions.  Normally it is only
> > something like 64 or 128 bytes in size though.  You might find that
> > reducing this loop to that smaller size may improve the performance
> > for larger payloads.
>
> I saw 128 to be better in my testing. For large packets this loop does
> all the work. I see performance dependent on the amount of loop
> overhead, i.e. we got it down to two non-adcq instructions but it is
> still noticeable. Also, this helps a lot on sizes up to 128 bytes
> since we only need to do single call in the jump table and no trip
> through the loop.

But one of your 'loop overhead' instructions is 'loop'.
Look at http://www.agner.org/optimize/instruction_tables.pdf
you don't want to be using 'loop' on intel cpus.

You might get some benefit from pipelining the loop (so you do
a read to register in one iteration and a register-register adc
the next).

	David