[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5354eeec562345f6a1de84f0b2081b75@AcuMS.aculab.com>
Date: Fri, 5 Jan 2024 10:41:24 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>, Noah Goldstein
<goldstein.w.n@...il.com>
CC: kernel test robot <lkp@...el.com>, "x86@...nel.org" <x86@...nel.org>,
"oe-kbuild-all@...ts.linux.dev" <oe-kbuild-all@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"edumazet@...gle.com" <edumazet@...gle.com>, "tglx@...utronix.de"
<tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de"
<bp@...en8.de>, "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"hpa@...or.com" <hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling
From: Linus Torvalds
> Sent: 05 January 2024 00:33
>
> On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > Anyway, since I looked at the thing originally, and feel like I know
> > the x86 side and understand the strange IP csum too, I just applied it
> > directly.
>
> I ended up just applying my 40-byte cleanup thing too that I've been
> keeping in my own tree since posting it (as the "Silly csum
> improvement. Maybe" patch).
Interesting, I'm pretty sure trying to get two blocks of
'adc' scheduled in parallel like that doesn't work.
I got an adc every clock from this 'beast':
+ /*
+ * Align the byte count to a multiple of 16 then
+ * add 64 bit words to alternating registers.
+ * Finally reduce to 64 bits.
+ */
+ asm( " bt $4, %[len]\n"
+ " jnc 10f\n"
+ " add (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 16(%[len]), %[len]\n"
+ "10: jecxz 20f\n"
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
+ "20: adc %[sum_0], %[sum]\n"
+ " adc %[sum_1], %[sum]\n"
+ " adc $0, %[sum]\n"
+ : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+ [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+ : [buff] "r" (buff)
+ : "memory" );
Followed by code to sort out and trailing 15 bytes.
Intel cpu (from P-II until Broadwell 5th-gen) take two clocks for 'adc'
(probably because it needs 3 inputs).
So 'adc' chains ran a lot slower than you might think.
(Clearly no one ever actually benchmarked the old code!)
The first fix made the carry output available early - so adding
to alternate registers helps. IIRC this is in Ivy/Sandy bridge.
Maybe no one cares about Ivy/Sandy bridge and Haswell any more.
AMD cpu don't have this problem.
I'm pretty sure I measured that loop with a misaligned buffer.
Measurably slower, but less than one clock per cache line.
I guess that the cache-line crossing reads get split, but you
gain most back because the cpu can do two reads/clock.
Maybe I'll sort out another patch...
I did get 15/16 bytes/clock with a similar loop that used adox/adcx
but that needed unrolling again and only works on a few cpu.
IIRC amd have some cpu that support adox - but execute it slowly!
Annoyingly you can't use 'loop' even on cpu that support adox
because it is stupidly slow on intel cpu (ok on amd).
That version is a lot of pain since it needs run-time patching.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists