linux-kernel - RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2aa6243491784e74960182dc12968170@AcuMS.aculab.com>
Date:   Thu, 9 Feb 2023 12:39:07 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'maobibo' <maobibo@...ngson.cn>,
        Huacai Chen <chenhuacai@...nel.org>,
        "WANG Xuerui" <kernel@...0n.name>
CC:     Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: maobibo
> Sent: 09 February 2023 11:55
> 
> 
> 在 2023/2/9 17:35, David Laight 写道:
> > From: Bibo Mao
> >> Sent: 09 February 2023 03:59
> >>
> >> loongArch platform is 64-bit system, which supports 8 bytes memory
> >> accessing, generic checksum function uses 4 byte memory access.
> >> This patch adds 8-bytes memory access optimization for checksum
> >> function on loongArch. And the code comes from arm64 system.
> >
> > How fast do these functions actually run (in bytes/clock)?
> With uint128 method, there will unrolled loop, instruction
> can execute in parallel. It gets the best result on loongarch
> system where there is no neither carry flag nor post-index
> addressing modes.

We're probably almost agreeing...

> Here is the piece of disassemble code with uint128 method:

Load 8 values:

>    120000a40:   28c0222f        ld.d    $r15,$r17,8(0x8)
>    120000a44:   28c0622a        ld.d    $r10,$r17,24(0x18)
>    120000a48:   28c0a230        ld.d    $r16,$r17,40(0x28)
>    120000a4c:   28c0e232        ld.d    $r18,$r17,56(0x38)
>    120000a50:   28c0022e        ld.d    $r14,$r17,0
>    120000a54:   28c0422d        ld.d    $r13,$r17,16(0x10)
>    120000a58:   28c0822b        ld.d    $r11,$r17,32(0x20)
>    120000a5c:   28c0c22c        ld.d    $r12,$r17,48(0x30)

Pairwise add them

>    120000a60:   0010b9f7        add.d   $r23,$r15,$r14
>    120000a64:   0010b54d        add.d   $r13,$r10,$r13
>    120000a68:   0010b24c        add.d   $r12,$r18,$r12
>    120000a6c:   0010ae0b        add.d   $r11,$r16,$r11

Generate 4 'carry' bits

>    120000a70:   0012c992        sltu    $r18,$r12,$r18
>    120000a74:   0012beee        sltu    $r14,$r23,$r15
>    120000a78:   0012c170        sltu    $r16,$r11,$r16
>    120000a7c:   0012a9aa        sltu    $r10,$r13,$r10

Add the carry bits onto the sums.
I've not quite worked out which add is which!
But I think you've missed a few adds here.

>    120000a80:   0010ae0f        add.d   $r15,$r16,$r11
>    120000a84:   0010ddce        add.d   $r14,$r14,$r23
>    120000a88:   0010b250        add.d   $r16,$r18,$r12
>    120000a8c:   0010b54d        add.d   $r13,$r10,$r13
>    120000a90:   0010b5d2        add.d   $r18,$r14,$r13
>    120000a94:   0010c1f0        add.d   $r16,$r15,$r16

Somewhere each value needs an add, an sltu to generate the 'carry',
and an add for the carry itself.
If you sum the carry bits into a separate register it is
possible to get a both adds and the sltu (for different values)
to run in the same clock (on a suitable cpu).
If there are 4 integer units you can also get the loop instructions
'for free' and unrolling 8 times may not be needed at all.

...
> There is no post-index addressing modes on loongarch,
>  	val = *mem;  // 64bit read
>         mem++;
>  	sum += val;
>  	carry = sum < val;
>  	carry_sum += carry;
> it takes 5 instruction and these 5 instructions depends on previous instr.

I'd assume the loop was unrolled enough so the address
increment doesn't matter.

> There is the piece of disassemble code:
>    120000d90:   28c001f0        ld.d    $r16,$r15,0
>    120000d94:   0010c58c        add.d   $r12,$r12,$r17
>    120000d98:   02c021ef        addi.d  $r15,$r15,8(0x8)

Those three instructions are independent.

>    120000d9c:   0010b20c        add.d   $r12,$r16,$r12

that one depends on the ld.d

>    120000da0:   0012c191        sltu    $r17,$r12,$r16

that depends on the add.d
but it could be execute after the 'bne' in parallel with the ld.d

>    120000da4:   5fffedf2        bne     $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>

If you tweak the code it is possible to get down to just
the addi.d and bne constraining the dependency chain.
(Assuming there is no delay on the read and there are an infinite
number of execution units.)
Unroll once and do:
	ld.d r,addr,0
	addi.d addr,16
	ld.d r,addr,-8
	bne addr,limit,loop_top
and you might get a loop that does a memory read every clock.

So you end up worrying about how the memory read delays affect
the instruction pipeline.
The Intel x86 cpu I've got just pile up the arithmetic instructions
waiting for the data to be read.
If you get a memory read requested every clock everything else
follows - provided you don't try to execute too many instrcutions
at once.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)