lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <e77165c267df486f914f8013fede1d32@AcuMS.aculab.com>
Date:   Tue, 22 Nov 2022 13:08:23 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>
CC:     Arnd Bergmann <arnd@...db.de>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Ingo Molnar" <mingo@...hat.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
Subject: Optimising csum_fold()

There are currently 20 copies of csum_fold(), some in C some in assembler.
The default C version (in asm-generic/checksum.h) is pretty horrid.
Some of the asm versions (including x86 and x86-64) aren't much better.

There are 3 pretty good C versions:
  1:	(~sum - rol32(sum, 16)) >> 16
  2:  ~(sum + rol32(sum, 16)) >> 16
  3:  (u16)~((sum + rol32(sum, 16)) >> 16)
All three are (usually) 4 arithmetic instructions.

The first two have the advantage that the high bits are zero.
Relevant when the value is being checked rather than set.

The first one can generate better instruction scheduling (the rotate
and invert can be executed in the same clock).

The 3rd one saves an instruction on arm, but may need masking.
(I've not compiled an arm kernel to see how often that happens.)

The only architectures where (I think) the current asm code is better
than the C above are sparc and sparc64.
Sparc doesn't have a rotate instruction, but does have a carry flag.
This makes the current asm version one instruction shorter.

For architectures like mips and risc-v which have neither rotate
instructions nor carry flags the C is as good as the current asm.
The rotate is 3 instructions - the same as the extra cmp+add.

Changing everything to use [1] would improve quite a few architectures
while only adding 1 clock to some paths in arm/arm64 and sparc.

Unfortunately it is all currently a mess.
Most architectures don't include asm-generic/checksum.h at all.

Thoughts?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ