linux-kernel - RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8bc06b0ef9c243b087e944e0611b54b5@AcuMS.aculab.com>
Date:   Thu, 16 Feb 2023 09:03:08 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'maobibo' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>
CC:     WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
> Sent: 14 February 2023 14:19
...
> Got it. It makes use of pipeline better, rather than number of ALUs for
> different micro-architectures. I will try this method, thanks again for
> kindly help and explanation with patience.

It is also worth pointing out that if the cpu does 'out of order'
execution it may be just as good to just repeat blocks of:
	load	v0, addr, 0*8
	add	sum0, v0
	sltu	v0, sum0, v0
	add	carry0, v0

Assuming the prefetch/decode logic can predict the loop
and generate enough decoded instruction for all the alu units.

The add/sltu/add will be queued until the load completes
and then execute in the next three clocks.
The load for the next block will be scheduled as soon as
the load/store unit has finished processing the previous load.
So all the alu instructions just wait for the required input
to be available and a memory load executes every clock.

Multiple sum0 and carry0 registers aren't actually needed.
But having 2 of each (even if the loop is unrolled 4 times)
might help a bit.

If the cpu does 'register renaming' (as most x86 do) you
can use the same register name for 'v0' in all the blocks
(even though it is alive with multiple values).

But a simpler in-order multi-issue cpu will need you to
correctly interleave the instructions for maximum throughput.
It also does no hard for a very simple cpu that has delays
before a read value can be used.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)