linux-kernel - RE: x86/csum: Remove unnecessary odd handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4313F9BB-DE2E-448F-A366-A68CAEA2BFE0@zytor.com>
Date: Sat, 06 Jan 2024 17:09:09 -0800
From: "H. Peter Anvin" <hpa@...or.com>
To: David Laight <David.Laight@...LAB.COM>,
        "'Linus Torvalds'" <torvalds@...ux-foundation.org>
CC: Noah Goldstein <goldstein.w.n@...il.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "oe-kbuild-all@...ts.linux.dev" <oe-kbuild-all@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
Subject: RE: x86/csum: Remove unnecessary odd handling

On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@...LAB.COM> wrote:
>From: Linus Torvalds
>> Sent: 05 January 2024 18:06
>> 
>> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@...lab.com> wrote:
>> >
>> > Interesting, I'm pretty sure trying to get two blocks of
>> >  'adc' scheduled in parallel like that doesn't work.
>> 
>> You should check out the benchmark at
>> 
>>        https://github.com/fenrus75/csum_partial
>> 
>> and see if you can improve on it. I'm including the patch (on top of
>> that code by Arjan) to implement the actual current kernel version as
>> "New version".
>
>Annoyingly (for me) you are partially right...
>
>I found where my ip checksum perf code was hiding and revisited it.
>Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
>did an adc every clock it isn't happening for me now.
>
>I'm only measuring the inner loop for multiples of 64 bytes.
>The code less than 8 bytes and partial final words is a
>separate problem.
>The less unrolled the main loop, the less overhead there'll
>be for 'normal' sizes.
>So I've changed your '80 byte' block to 64 bytes for consistency.
>
>I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
>(adc takes two clocks - although adc to alternate regs is one clock
>on sandy bridge).
>My test system is an i7-7700, I think anything from broadwell (gen 4)
>will be at least as good.
>I don't have a modern amd cpu.
>
>The best loop for 256+ bytes is an adxc/adxo one.
>However that requires the run-time patching.
>Followed by new kernel version (two blocks of 4 adc).
>The surprising one is:
>		xor	sum, sum
>	1:	adc	(buff), sum
>		adc	8(buff), sum
>		lea	16(buff), buff
>		dec	count
>		jnz	1b
>		adc	$0, sum
>For 256 bytes it is only a couple of clocks slower.
>Maybe 10% slower for 512+ bytes.
>But it need almost no extra code for 'normal' buffer sizes.
>By comparison the adxc/adxo one is 20% faster.
>
>The code is doing:
>	old = rdpmc
>	mfence
>	csum = do_csum(buf, len);
>	mfence
>	clocks = rdpmc - old
>(That is directly reading the pmc register.)
>With 'no-op' function it takes 160 clocks (I-cache resident).
>Without the mfence 40 - but pretty much everything can execute
>after the 2nd rdpmc.
>
>I've attached my (horrid) test program.
>
>	David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)

Rather than runtime patching perhaps separate paths...