linux-kernel - Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <741b2246-d609-ccc6-bf55-d6b0b5e54b54@loongson.cn>
Date:   Thu, 9 Feb 2023 19:55:27 +0800
From:   maobibo <maobibo@...ngson.cn>
To:     David Laight <David.Laight@...LAB.COM>,
        Huacai Chen <chenhuacai@...nel.org>,
        WANG Xuerui <kernel@...0n.name>
Cc:     Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system



在 2023/2/9 17:35, David Laight 写道:
> From: Bibo Mao
>> Sent: 09 February 2023 03:59
>>
>> loongArch platform is 64-bit system, which supports 8 bytes memory
>> accessing, generic checksum function uses 4 byte memory access.
>> This patch adds 8-bytes memory access optimization for checksum
>> function on loongArch. And the code comes from arm64 system.
> 
> How fast do these functions actually run (in bytes/clock)?
With uint128 method, there will unrolled loop, instruction
can execute in parallel. It gets the best result on loongarch
system where there is no neither carry flag nor post-index
addressing modes.

Here is the piece of disassemble code with uint128 method:
   120000a40:   28c0222f        ld.d    $r15,$r17,8(0x8)
   120000a44:   28c0622a        ld.d    $r10,$r17,24(0x18)
   120000a48:   28c0a230        ld.d    $r16,$r17,40(0x28)
   120000a4c:   28c0e232        ld.d    $r18,$r17,56(0x38)
   120000a50:   28c0022e        ld.d    $r14,$r17,0
   120000a54:   28c0422d        ld.d    $r13,$r17,16(0x10)
   120000a58:   28c0822b        ld.d    $r11,$r17,32(0x20)
   120000a5c:   28c0c22c        ld.d    $r12,$r17,48(0x30)
   120000a60:   0010b9f7        add.d   $r23,$r15,$r14
   120000a64:   0010b54d        add.d   $r13,$r10,$r13
   120000a68:   0010b24c        add.d   $r12,$r18,$r12
   120000a6c:   0010ae0b        add.d   $r11,$r16,$r11
   120000a70:   0012c992        sltu    $r18,$r12,$r18
   120000a74:   0012beee        sltu    $r14,$r23,$r15
   120000a78:   0012c170        sltu    $r16,$r11,$r16
   120000a7c:   0012a9aa        sltu    $r10,$r13,$r10
   120000a80:   0010ae0f        add.d   $r15,$r16,$r11
   120000a84:   0010ddce        add.d   $r14,$r14,$r23
   120000a88:   0010b250        add.d   $r16,$r18,$r12
   120000a8c:   0010b54d        add.d   $r13,$r10,$r13
   120000a90:   0010b5d2        add.d   $r18,$r14,$r13
   120000a94:   0010c1f0        add.d   $r16,$r15,$r16

> 
> It is quite possible that just adding 32bit values to a
> 64bit register is faster.
> Any non-trivial cpu will run that at 4 bytes/clock
> (for suitably unrolled and pipelined code).
> On a more complex cpu adding to two registers will
> give 8 bytes/clock (needs two memory loads/clock).
> 
> The fastest 64bit sum you'll get on anything mips-like
> (no carry flag) is probably from something like:
> 	val = *mem++;  // 64bit read
> 	sum += val;
> 	carry = sum < val;
> 	carry_sum += carry;
> which is 2 bytes/instruction again.
> To get to 8 bytes/clock you need to execute all 4 instructions
> every clock - so 1 read and 3 arithmetic.
There is no post-index addressing modes on loongarch, 
 	val = *mem;  // 64bit read
        mem++;
 	sum += val;
 	carry = sum < val;
 	carry_sum += carry;
it takes 5 instruction and these 5 instructions depends on previous instr.
There is the piece of disassemble code:
   120000d90:   28c001f0        ld.d    $r16,$r15,0
   120000d94:   0010c58c        add.d   $r12,$r12,$r17
   120000d98:   02c021ef        addi.d  $r15,$r15,8(0x8)
   120000d9c:   0010b20c        add.d   $r12,$r16,$r12
   120000da0:   0012c191        sltu    $r17,$r12,$r16
   120000da4:   5fffedf2        bne     $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>

regards
bibo, mao
          

> (c/f 2 read and 2 arithmetic for 32bit adds.)
> 
> Arm has a carry flag so the code is:
> 	val = *mem++;
> 	temp,carry = sum + val;
> 	sum = sum + val + carry;
> There are still two dependant arithmetic instructions for
> each 8-byte word.
> The dependencies on the flags register also make it harder
> to get any benefit from interleaving adds to two registers.
> 
> x86-64 uses 64bit 'add with carry' chains.
> No one ever noticed that they take two clocks each on
> Intel cpu until (about) Haswell.
> It is possible to get 12 bytes/clock with some strange
> loops that use (IIRC) adxo and adxc.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)