[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <124b21857fe44e499e29800cbf4f63f8@AcuMS.aculab.com>
Date: Sat, 6 Jan 2024 22:08:48 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>
CC: Noah Goldstein <goldstein.w.n@...il.com>, "x86@...nel.org"
<x86@...nel.org>, "oe-kbuild-all@...ts.linux.dev"
<oe-kbuild-all@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "edumazet@...gle.com" <edumazet@...gle.com>,
"tglx@...utronix.de" <tglx@...utronix.de>, "mingo@...hat.com"
<mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "hpa@...or.com"
<hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling
From: Linus Torvalds
> Sent: 05 January 2024 18:06
>
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@...lab.com> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> > 'adc' scheduled in parallel like that doesn't work.
>
> You should check out the benchmark at
>
> https://github.com/fenrus75/csum_partial
>
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".
Annoyingly (for me) you are partially right...
I found where my ip checksum perf code was hiding and revisited it.
Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
did an adc every clock it isn't happening for me now.
I'm only measuring the inner loop for multiples of 64 bytes.
The code less than 8 bytes and partial final words is a
separate problem.
The less unrolled the main loop, the less overhead there'll
be for 'normal' sizes.
So I've changed your '80 byte' block to 64 bytes for consistency.
I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
(adc takes two clocks - although adc to alternate regs is one clock
on sandy bridge).
My test system is an i7-7700, I think anything from broadwell (gen 4)
will be at least as good.
I don't have a modern amd cpu.
The best loop for 256+ bytes is an adxc/adxo one.
However that requires the run-time patching.
Followed by new kernel version (two blocks of 4 adc).
The surprising one is:
xor sum, sum
1: adc (buff), sum
adc 8(buff), sum
lea 16(buff), buff
dec count
jnz 1b
adc $0, sum
For 256 bytes it is only a couple of clocks slower.
Maybe 10% slower for 512+ bytes.
But it need almost no extra code for 'normal' buffer sizes.
By comparison the adxc/adxo one is 20% faster.
The code is doing:
old = rdpmc
mfence
csum = do_csum(buf, len);
mfence
clocks = rdpmc - old
(That is directly reading the pmc register.)
With 'no-op' function it takes 160 clocks (I-cache resident).
Without the mfence 40 - but pretty much everything can execute
after the 2nd rdpmc.
I've attached my (horrid) test program.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
View attachment "ipcsum.c" of type "text/plain" (12950 bytes)
Powered by blists - more mailing lists