lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dc1c39bbaf149f78c132f4467cdb365@AcuMS.aculab.com>
Date:   Tue, 14 Feb 2023 09:47:29 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'maobibo' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>
CC:     WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
> Sent: 14 February 2023 01:31
...
> Part of asm code depends on previous intr in website
> https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
> such as macro ADDC
> #define ADDC(sum,reg)                                           \
>         ADD     sum, sum, reg;                                  \
>         sltu    t8, sum, reg;                                   \
>         ADD     sum, sum, t8;                                   \
> these three instructions depends on each other, and can not execute
> in parallel.

Right, but you can add the carry bits into a different register.
Since the aim is 8 bytes/clock limited by 1 memory read/clock
you can (probably) manage with all the word adds going to one
register and all the carry adds to a second. So:
#define ADDC(carry, sum, reg) \
	add	sum, sum, reg	\
	sltu	reg, sum, reg	\
	add	carry, carry, reg

> 
> The original of main loop about Lmove_128bytes is:
> #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3)     \
>         LOAD    _t0, src, (offset + UNIT(0));                   \
>         LOAD    _t1, src, (offset + UNIT(1));                   \
>         LOAD    _t2, src, (offset + UNIT(2));                   \
>         LOAD    _t3, src, (offset + UNIT(3));                   \
>         ADDC(_t0, _t1);                                         \
>         ADDC(_t2, _t3);                                         \
>         ADDC(sum, _t0);                                         \
>         ADDC(sum, _t2)
> 
> .Lmove_128bytes:
>         CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
>         CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
>         addi.d  t5, t5, -1
>         addi.d  src, src, 0x80
>         bnez    t5, .Lmove_128bytes
> 
> I modified the main loop with label .Lmove_128bytes to reduce
> dependency between instructions like this, it can improve the
> performance.
> can improve the performance.
> .Lmove_128bytes:
>         LOAD    t0, src, 0
>         LOAD    t1, src, 8
>         LOAD    t3, src, 16
>         LOAD    t4, src, 24
>         LOAD    a3, src, 0 + 0x20
>         LOAD    a4, src, 8 + 0x20
>         LOAD    a5, src, 16 + 0x20
>         LOAD    a6, src, 24 + 0x20
>         ADD     t0, t0,  t1
>         ADD     t3, t3,  t4
>         ADD     a3, a3,  a4
>         ADD     a5, a5,  a6
>         sltu    t8, t0,  t1
>         sltu    a7, t3,  t4
>         ADD     t0, t0,  t8
>         ADD     t3, t3,  a7
>         sltu    t1, a3,  a4
>         sltu    t4, a5,  a6
>         ADD     a3, a3,  t1
>         ADD     a5, a5,  t4
>         ADD     t0, t0, t3
>         ADD     a3, a3, a5
>         sltu    t1, t0, t3
>         sltu    t4, a3, a5
>         ADD     t0, t0, t1
>         ADD     a3, a3, t4
>         ADD     sum, sum, t0
>         sltu    t8,  sum, t0
>         ADD     sum, sum,  t8
>         ADD     sum, sum, a3
>         sltu    t8,  sum, a3
>         addi.d  t5, t5, -1
>         ADD     sum, sum, t8
> 
> However the result and principle is almost the similar with
> uint128 c code. And there is no performance impact interleaving
> the reads and alu operations.

You are still relying on the 'out of order' logic to execute
ALU instructions while the memory reads are going on.
Try something like:
	complex setup :-)
loop:
	sltu	c0, sum, v0
	load	v0, src, 0
	add	sum, v1
	add	carry, c3

	sltu	c1, sum, v1
	load	v1, src, 8
	add	sum, v2
	add	carry, c0

	sltu	c2, sum, v2
	load	v2, src, 16
	addi	src, 32
	add	sum, v3
	add	carry, c1

	sltu	c3, sum, v3
	load	v3, src, 24
	add	sum, v0
	add	carry, c2
	bne	src, limit, loop

	complex finalise

The idea being that each group of instructions executes
in one clock - so the loop is 4 clocks.
The above code allows for 2 delay clocks on reads.
They may not be needed, in that case the above may run
at 8 bytes/clock with just 2 blocks of instructions.

You'd give the cpu a bit more leeway by using two sum and
carry registers.

I'd time the loop without worrying about the setup/finalise
code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ