linux-kernel - RE: x86/csum: Remove unnecessary odd handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <124b21857fe44e499e29800cbf4f63f8@AcuMS.aculab.com>
Date: Sat, 6 Jan 2024 22:08:48 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>
CC: Noah Goldstein <goldstein.w.n@...il.com>, "x86@...nel.org"
	<x86@...nel.org>, "oe-kbuild-all@...ts.linux.dev"
	<oe-kbuild-all@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "edumazet@...gle.com" <edumazet@...gle.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "mingo@...hat.com"
	<mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "hpa@...or.com"
	<hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling

From: Linus Torvalds
> Sent: 05 January 2024 18:06
> 
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@...lab.com> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> >  'adc' scheduled in parallel like that doesn't work.
> 
> You should check out the benchmark at
> 
>        https://github.com/fenrus75/csum_partial
> 
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".

Annoyingly (for me) you are partially right...

I found where my ip checksum perf code was hiding and revisited it.
Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
did an adc every clock it isn't happening for me now.

I'm only measuring the inner loop for multiples of 64 bytes.
The code less than 8 bytes and partial final words is a
separate problem.
The less unrolled the main loop, the less overhead there'll
be for 'normal' sizes.
So I've changed your '80 byte' block to 64 bytes for consistency.

I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
(adc takes two clocks - although adc to alternate regs is one clock
on sandy bridge).
My test system is an i7-7700, I think anything from broadwell (gen 4)
will be at least as good.
I don't have a modern amd cpu.

The best loop for 256+ bytes is an adxc/adxo one.
However that requires the run-time patching.
Followed by new kernel version (two blocks of 4 adc).
The surprising one is:
		xor	sum, sum
	1:	adc	(buff), sum
		adc	8(buff), sum
		lea	16(buff), buff
		dec	count
		jnz	1b
		adc	$0, sum
For 256 bytes it is only a couple of clocks slower.
Maybe 10% slower for 512+ bytes.
But it need almost no extra code for 'normal' buffer sizes.
By comparison the adxc/adxo one is 20% faster.

The code is doing:
	old = rdpmc
	mfence
	csum = do_csum(buf, len);
	mfence
	clocks = rdpmc - old
(That is directly reading the pmc register.)
With 'no-op' function it takes 160 clocks (I-cache resident).
Without the mfence 40 - but pretty much everything can execute
after the 2nd rdpmc.

I've attached my (horrid) test program.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

View attachment "ipcsum.c" of type "text/plain" (12950 bytes)