linux-kernel - RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5dc1c39bbaf149f78c132f4467cdb365@AcuMS.aculab.com>
Date:   Tue, 14 Feb 2023 09:47:29 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'maobibo' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>
CC:     WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
> Sent: 14 February 2023 01:31
...
> Part of asm code depends on previous intr in website
> https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
> such as macro ADDC
> #define ADDC(sum,reg)                                           \
>         ADD     sum, sum, reg;                                  \
>         sltu    t8, sum, reg;                                   \
>         ADD     sum, sum, t8;                                   \
> these three instructions depends on each other, and can not execute
> in parallel.

Right, but you can add the carry bits into a different register.
Since the aim is 8 bytes/clock limited by 1 memory read/clock
you can (probably) manage with all the word adds going to one
register and all the carry adds to a second. So:
#define ADDC(carry, sum, reg) \
	add	sum, sum, reg	\
	sltu	reg, sum, reg	\
	add	carry, carry, reg

> 
> The original of main loop about Lmove_128bytes is:
> #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3)     \
>         LOAD    _t0, src, (offset + UNIT(0));                   \
>         LOAD    _t1, src, (offset + UNIT(1));                   \
>         LOAD    _t2, src, (offset + UNIT(2));                   \
>         LOAD    _t3, src, (offset + UNIT(3));                   \
>         ADDC(_t0, _t1);                                         \
>         ADDC(_t2, _t3);                                         \
>         ADDC(sum, _t0);                                         \
>         ADDC(sum, _t2)
> 
> .Lmove_128bytes:
>         CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
>         addi.d  t5, t5, -1
>         addi.d  src, src, 0x80
>         bnez    t5, .Lmove_128bytes
> 
> I modified the main loop with label .Lmove_128bytes to reduce
> dependency between instructions like this, it can improve the
> performance.
> can improve the performance.
> .Lmove_128bytes:
>         LOAD    t0, src, 0
>         LOAD    t1, src, 8
>         LOAD    t3, src, 16
>         LOAD    t4, src, 24
>         LOAD    a3, src, 0 + 0x20
>         LOAD    a4, src, 8 + 0x20
>         LOAD    a5, src, 16 + 0x20
>         LOAD    a6, src, 24 + 0x20
>         ADD     t0, t0,  t1
>         ADD     t3, t3,  t4
>         ADD     a3, a3,  a4
>         ADD     a5, a5,  a6
>         sltu    t8, t0,  t1
>         sltu    a7, t3,  t4
>         ADD     t0, t0,  t8
>         ADD     t3, t3,  a7
>         sltu    t1, a3,  a4
>         sltu    t4, a5,  a6
>         ADD     a3, a3,  t1
>         ADD     a5, a5,  t4
>         ADD     t0, t0, t3
>         ADD     a3, a3, a5
>         sltu    t1, t0, t3
>         sltu    t4, a3, a5
>         ADD     t0, t0, t1
>         ADD     a3, a3, t4
>         ADD     sum, sum, t0
>         sltu    t8,  sum, t0
>         ADD     sum, sum,  t8
>         ADD     sum, sum, a3
>         sltu    t8,  sum, a3
>         addi.d  t5, t5, -1
>         ADD     sum, sum, t8
> 
> However the result and principle is almost the similar with
> uint128 c code. And there is no performance impact interleaving
> the reads and alu operations.

You are still relying on the 'out of order' logic to execute
ALU instructions while the memory reads are going on.
Try something like:
	complex setup :-)
loop:
	sltu	c0, sum, v0
	load	v0, src, 0
	add	sum, v1
	add	carry, c3

	sltu	c1, sum, v1
	load	v1, src, 8
	add	sum, v2
	add	carry, c0

	sltu	c2, sum, v2
	load	v2, src, 16
	addi	src, 32
	add	sum, v3
	add	carry, c1

	sltu	c3, sum, v3
	load	v3, src, 24
	add	sum, v0
	add	carry, c2
	bne	src, limit, loop

	complex finalise

The idea being that each group of instructions executes
in one clock - so the loop is 4 clocks.
The above code allows for 2 delay clocks on reads.
They may not be needed, in that case the above may run
at 8 bytes/clock with just 2 blocks of instructions.

You'd give the cpu a bit more leeway by using two sum and
carry registers.

I'd time the loop without worrying about the setup/finalise
code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)