lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8bc06b0ef9c243b087e944e0611b54b5@AcuMS.aculab.com>
Date:   Thu, 16 Feb 2023 09:03:08 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'maobibo' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>
CC:     WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
> Sent: 14 February 2023 14:19
...
> Got it. It makes use of pipeline better, rather than number of ALUs for
> different micro-architectures. I will try this method, thanks again for
> kindly help and explanation with patience.

It is also worth pointing out that if the cpu does 'out of order'
execution it may be just as good to just repeat blocks of:
	load	v0, addr, 0*8
	add	sum0, v0
	sltu	v0, sum0, v0
	add	carry0, v0

Assuming the prefetch/decode logic can predict the loop
and generate enough decoded instruction for all the alu units.

The add/sltu/add will be queued until the load completes
and then execute in the next three clocks.
The load for the next block will be scheduled as soon as
the load/store unit has finished processing the previous load.
So all the alu instructions just wait for the required input
to be available and a memory load executes every clock.

Multiple sum0 and carry0 registers aren't actually needed.
But having 2 of each (even if the loop is unrolled 4 times)
might help a bit.

If the cpu does 'register renaming' (as most x86 do) you
can use the same register name for 'v0' in all the blocks
(even though it is alive with multiple values).

But a simpler in-order multi-issue cpu will need you to
correctly interleave the instructions for maximum throughput.
It also does no hard for a very simple cpu that has delays
before a read value can be used.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ