linux-kernel - RE: x86/csum: Remove unnecessary odd handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5354eeec562345f6a1de84f0b2081b75@AcuMS.aculab.com>
Date: Fri, 5 Jan 2024 10:41:24 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>, Noah Goldstein
	<goldstein.w.n@...il.com>
CC: kernel test robot <lkp@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"oe-kbuild-all@...ts.linux.dev" <oe-kbuild-all@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"edumazet@...gle.com" <edumazet@...gle.com>, "tglx@...utronix.de"
	<tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de"
	<bp@...en8.de>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"hpa@...or.com" <hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling

From: Linus Torvalds
> Sent: 05 January 2024 00:33
> 
> On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > Anyway, since I looked at the thing originally, and feel like I know
> > the x86 side and understand the strange IP csum too, I just applied it
> > directly.
> 
> I ended up just applying my 40-byte cleanup thing too that I've been
> keeping in my own tree since posting it (as the "Silly csum
> improvement. Maybe" patch).

Interesting, I'm pretty sure trying to get two blocks of
 'adc' scheduled in parallel like that doesn't work.

I got an adc every clock from this 'beast':
+       /*
+        * Align the byte count to a multiple of 16 then
+        * add 64 bit words to alternating registers.
+        * Finally reduce to 64 bits.
+        */
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

Followed by code to sort out and trailing 15 bytes.

Intel cpu (from P-II until Broadwell 5th-gen) take two clocks for 'adc'
(probably because it needs 3 inputs).
So 'adc'  chains ran a lot slower than you might think.
(Clearly no one ever actually benchmarked the old code!)
The first fix made the carry output available early - so adding
to alternate registers helps. IIRC this is in Ivy/Sandy bridge.
Maybe no one cares about Ivy/Sandy bridge and Haswell any more.
AMD cpu don't have this problem.

I'm pretty sure I measured that loop with a misaligned buffer.
Measurably slower, but less than one clock per cache line.
I guess that the cache-line crossing reads get split, but you
gain most back because the cpu can do two reads/clock.

Maybe I'll sort out another patch...

I did get 15/16 bytes/clock with a similar loop that used adox/adcx
but that needed unrolling again and only works on a few cpu.
IIRC amd have some cpu that support adox - but execute it slowly!
Annoyingly you can't use 'loop' even on cpu that support adox
because it is stupidly slow on intel cpu (ok on amd).

That version is a lot of pain since it needs run-time patching.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)