[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4313F9BB-DE2E-448F-A366-A68CAEA2BFE0@zytor.com>
Date: Sat, 06 Jan 2024 17:09:09 -0800
From: "H. Peter Anvin" <hpa@...or.com>
To: David Laight <David.Laight@...LAB.COM>,
"'Linus Torvalds'" <torvalds@...ux-foundation.org>
CC: Noah Goldstein <goldstein.w.n@...il.com>,
"x86@...nel.org" <x86@...nel.org>,
"oe-kbuild-all@...ts.linux.dev" <oe-kbuild-all@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"edumazet@...gle.com" <edumazet@...gle.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
Subject: RE: x86/csum: Remove unnecessary odd handling
On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@...LAB.COM> wrote:
>From: Linus Torvalds
>> Sent: 05 January 2024 18:06
>>
>> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@...lab.com> wrote:
>> >
>> > Interesting, I'm pretty sure trying to get two blocks of
>> > 'adc' scheduled in parallel like that doesn't work.
>>
>> You should check out the benchmark at
>>
>> https://github.com/fenrus75/csum_partial
>>
>> and see if you can improve on it. I'm including the patch (on top of
>> that code by Arjan) to implement the actual current kernel version as
>> "New version".
>
>Annoyingly (for me) you are partially right...
>
>I found where my ip checksum perf code was hiding and revisited it.
>Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
>did an adc every clock it isn't happening for me now.
>
>I'm only measuring the inner loop for multiples of 64 bytes.
>The code less than 8 bytes and partial final words is a
>separate problem.
>The less unrolled the main loop, the less overhead there'll
>be for 'normal' sizes.
>So I've changed your '80 byte' block to 64 bytes for consistency.
>
>I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
>(adc takes two clocks - although adc to alternate regs is one clock
>on sandy bridge).
>My test system is an i7-7700, I think anything from broadwell (gen 4)
>will be at least as good.
>I don't have a modern amd cpu.
>
>The best loop for 256+ bytes is an adxc/adxo one.
>However that requires the run-time patching.
>Followed by new kernel version (two blocks of 4 adc).
>The surprising one is:
> xor sum, sum
> 1: adc (buff), sum
> adc 8(buff), sum
> lea 16(buff), buff
> dec count
> jnz 1b
> adc $0, sum
>For 256 bytes it is only a couple of clocks slower.
>Maybe 10% slower for 512+ bytes.
>But it need almost no extra code for 'normal' buffer sizes.
>By comparison the adxc/adxo one is 20% faster.
>
>The code is doing:
> old = rdpmc
> mfence
> csum = do_csum(buf, len);
> mfence
> clocks = rdpmc - old
>(That is directly reading the pmc register.)
>With 'no-op' function it takes 160 clocks (I-cache resident).
>Without the mfence 40 - but pretty much everything can execute
>after the 2nd rdpmc.
>
>I've attached my (horrid) test program.
>
> David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)
Rather than runtime patching perhaps separate paths...
Powered by blists - more mailing lists