[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ff2151323cd440da25fd49061d9cc44@AcuMS.aculab.com>
Date: Fri, 5 Jan 2024 16:12:05 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>, 'Noah Goldstein'
<goldstein.w.n@...il.com>
CC: 'kernel test robot' <lkp@...el.com>, "'x86@...nel.org'" <x86@...nel.org>,
"'oe-kbuild-all@...ts.linux.dev'" <oe-kbuild-all@...ts.linux.dev>,
"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
"'edumazet@...gle.com'" <edumazet@...gle.com>, "'tglx@...utronix.de'"
<tglx@...utronix.de>, "'mingo@...hat.com'" <mingo@...hat.com>,
"'bp@...en8.de'" <bp@...en8.de>, "'dave.hansen@...ux.intel.com'"
<dave.hansen@...ux.intel.com>, "'hpa@...or.com'" <hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling
From: David Laight
> Sent: 05 January 2024 10:41
>
> From: Linus Torvalds
> > Sent: 05 January 2024 00:33
> >
> > On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> > <torvalds@...ux-foundation.org> wrote:
> > >
> > > Anyway, since I looked at the thing originally, and feel like I know
> > > the x86 side and understand the strange IP csum too, I just applied it
> > > directly.
> >
> > I ended up just applying my 40-byte cleanup thing too that I've been
> > keeping in my own tree since posting it (as the "Silly csum
> > improvement. Maybe" patch).
>
> Interesting, I'm pretty sure trying to get two blocks of
> 'adc' scheduled in parallel like that doesn't work.
>
> I got an adc every clock from this 'beast':
> + /*
> + * Align the byte count to a multiple of 16 then
> + * add 64 bit words to alternating registers.
> + * Finally reduce to 64 bits.
> + */
> + asm( " bt $4, %[len]\n"
> + " jnc 10f\n"
> + " add (%[buff], %[len]), %[sum_0]\n"
> + " adc 8(%[buff], %[len]), %[sum_1]\n"
> + " lea 16(%[len]), %[len]\n"
> + "10: jecxz 20f\n"
> + " adc (%[buff], %[len]), %[sum_0]\n"
> + " adc 8(%[buff], %[len]), %[sum_1]\n"
> + " lea 32(%[len]), %[len_tmp]\n"
> + " adc 16(%[buff], %[len]), %[sum_0]\n"
> + " adc 24(%[buff], %[len]), %[sum_1]\n"
> + " mov %[len_tmp], %[len]\n"
> + " jmp 10b\n"
> + "20: adc %[sum_0], %[sum]\n"
> + " adc %[sum_1], %[sum]\n"
> + " adc $0, %[sum]\n"
> + : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> + [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> + : [buff] "r" (buff)
> + : "memory" );
I've got far too many x86 checksum functions lying around.
Actually you don't need all that.
Anything recent (probably Broadwell on) will execute:
"10: jecxz 20f\n"
" adc (%[buff], %[len]), %[sum]\n"
" adc 8(%[buff], %[len]), %[sum]\n"
" lea 16(%[len]), %[len]\n"
" jmp 10b\n"
"20: adc $0, %[sum]\n"
in two clocks per iteration - 8 bytes/clock.
Since it is trivial to handle 8n+4 buffers (eg as above)
that only leaves the C code to handle the final 0-7 bytes.
> Maybe I'll sort out another patch...
Probably after the next rc1 is out.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists