lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ff2151323cd440da25fd49061d9cc44@AcuMS.aculab.com>
Date: Fri, 5 Jan 2024 16:12:05 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Linus Torvalds' <torvalds@...ux-foundation.org>, 'Noah Goldstein'
	<goldstein.w.n@...il.com>
CC: 'kernel test robot' <lkp@...el.com>, "'x86@...nel.org'" <x86@...nel.org>,
	"'oe-kbuild-all@...ts.linux.dev'" <oe-kbuild-all@...ts.linux.dev>,
	"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
	"'edumazet@...gle.com'" <edumazet@...gle.com>, "'tglx@...utronix.de'"
	<tglx@...utronix.de>, "'mingo@...hat.com'" <mingo@...hat.com>,
	"'bp@...en8.de'" <bp@...en8.de>, "'dave.hansen@...ux.intel.com'"
	<dave.hansen@...ux.intel.com>, "'hpa@...or.com'" <hpa@...or.com>
Subject: RE: x86/csum: Remove unnecessary odd handling

From: David Laight
> Sent: 05 January 2024 10:41
> 
> From: Linus Torvalds
> > Sent: 05 January 2024 00:33
> >
> > On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> > <torvalds@...ux-foundation.org> wrote:
> > >
> > > Anyway, since I looked at the thing originally, and feel like I know
> > > the x86 side and understand the strange IP csum too, I just applied it
> > > directly.
> >
> > I ended up just applying my 40-byte cleanup thing too that I've been
> > keeping in my own tree since posting it (as the "Silly csum
> > improvement. Maybe" patch).
> 
> Interesting, I'm pretty sure trying to get two blocks of
>  'adc' scheduled in parallel like that doesn't work.
> 
> I got an adc every clock from this 'beast':
> +       /*
> +        * Align the byte count to a multiple of 16 then
> +        * add 64 bit words to alternating registers.
> +        * Finally reduce to 64 bits.
> +        */
> +       asm(    "       bt    $4, %[len]\n"
> +               "       jnc   10f\n"
> +               "       add   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   16(%[len]), %[len]\n"
> +               "10:    jecxz 20f\n"
> +               "       adc   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   32(%[len]), %[len_tmp]\n"
> +               "       adc   16(%[buff], %[len]), %[sum_0]\n"
> +               "       adc   24(%[buff], %[len]), %[sum_1]\n"
> +               "       mov   %[len_tmp], %[len]\n"
> +               "       jmp   10b\n"
> +               "20:    adc   %[sum_0], %[sum]\n"
> +               "       adc   %[sum_1], %[sum]\n"
> +               "       adc   $0, %[sum]\n"
> +           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> +               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> +           : [buff] "r" (buff)
> +           : "memory" );

I've got far too many x86 checksum functions lying around.

Actually you don't need all that.
Anything recent (probably Broadwell on) will execute:
	"10:    jecxz 20f\n"
	"       adc   (%[buff], %[len]), %[sum]\n"
	"       adc   8(%[buff], %[len]), %[sum]\n"
	"       lea   16(%[len]), %[len]\n"
	"       jmp   10b\n"
	"20:    adc   $0, %[sum]\n"
in two clocks per iteration - 8 bytes/clock.
Since it is trivial to handle 8n+4 buffers (eg as above)
that only leaves the C code to handle the final 0-7 bytes.

> Maybe I'll sort out another patch...

Probably after the next rc1 is out.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ