lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6bb59c32134477aa4890047ae5ad51b@AcuMS.aculab.com>
Date:   Thu, 9 Feb 2023 09:35:29 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Bibo Mao' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>,
        WANG Xuerui <kernel@...0n.name>
CC:     Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: Bibo Mao
> Sent: 09 February 2023 03:59
> 
> loongArch platform is 64-bit system, which supports 8 bytes memory
> accessing, generic checksum function uses 4 byte memory access.
> This patch adds 8-bytes memory access optimization for checksum
> function on loongArch. And the code comes from arm64 system.

How fast do these functions actually run (in bytes/clock)?

It is quite possible that just adding 32bit values to a
64bit register is faster.
Any non-trivial cpu will run that at 4 bytes/clock
(for suitably unrolled and pipelined code).
On a more complex cpu adding to two registers will
give 8 bytes/clock (needs two memory loads/clock).

The fastest 64bit sum you'll get on anything mips-like
(no carry flag) is probably from something like:
	val = *mem++;  // 64bit read
	sum += val;
	carry = sum < val;
	carry_sum += carry;
which is 2 bytes/instruction again.
To get to 8 bytes/clock you need to execute all 4 instructions
every clock - so 1 read and 3 arithmetic.
(c/f 2 read and 2 arithmetic for 32bit adds.)

Arm has a carry flag so the code is:
	val = *mem++;
	temp,carry = sum + val;
	sum = sum + val + carry;
There are still two dependant arithmetic instructions for
each 8-byte word.
The dependencies on the flags register also make it harder
to get any benefit from interleaving adds to two registers.

x86-64 uses 64bit 'add with carry' chains.
No one ever noticed that they take two clocks each on
Intel cpu until (about) Haswell.
It is possible to get 12 bytes/clock with some strange
loops that use (IIRC) adxo and adxc.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ