lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 10 Feb 2023 11:21:33 +0800
From:   Huacai Chen <chenhuacai@...nel.org>
To:     David Laight <David.Laight@...lab.com>
Cc:     maobibo <maobibo@...ngson.cn>, WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

This commit comes from the old internal kernel, I want to know which
one has better performance.

https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb

On Thu, Feb 9, 2023 at 8:39 PM David Laight <David.Laight@...lab.com> wrote:
>
> From: maobibo
> > Sent: 09 February 2023 11:55
> >
> >
> > 在 2023/2/9 17:35, David Laight 写道:
> > > From: Bibo Mao
> > >> Sent: 09 February 2023 03:59
> > >>
> > >> loongArch platform is 64-bit system, which supports 8 bytes memory
> > >> accessing, generic checksum function uses 4 byte memory access.
> > >> This patch adds 8-bytes memory access optimization for checksum
> > >> function on loongArch. And the code comes from arm64 system.
> > >
> > > How fast do these functions actually run (in bytes/clock)?
> > With uint128 method, there will unrolled loop, instruction
> > can execute in parallel. It gets the best result on loongarch
> > system where there is no neither carry flag nor post-index
> > addressing modes.
>
> We're probably almost agreeing...
>
> > Here is the piece of disassemble code with uint128 method:
>
> Load 8 values:
>
> >    120000a40:   28c0222f        ld.d    $r15,$r17,8(0x8)
> >    120000a44:   28c0622a        ld.d    $r10,$r17,24(0x18)
> >    120000a48:   28c0a230        ld.d    $r16,$r17,40(0x28)
> >    120000a4c:   28c0e232        ld.d    $r18,$r17,56(0x38)
> >    120000a50:   28c0022e        ld.d    $r14,$r17,0
> >    120000a54:   28c0422d        ld.d    $r13,$r17,16(0x10)
> >    120000a58:   28c0822b        ld.d    $r11,$r17,32(0x20)
> >    120000a5c:   28c0c22c        ld.d    $r12,$r17,48(0x30)
>
> Pairwise add them
>
> >    120000a60:   0010b9f7        add.d   $r23,$r15,$r14
> >    120000a64:   0010b54d        add.d   $r13,$r10,$r13
> >    120000a68:   0010b24c        add.d   $r12,$r18,$r12
> >    120000a6c:   0010ae0b        add.d   $r11,$r16,$r11
>
> Generate 4 'carry' bits
>
> >    120000a70:   0012c992        sltu    $r18,$r12,$r18
> >    120000a74:   0012beee        sltu    $r14,$r23,$r15
> >    120000a78:   0012c170        sltu    $r16,$r11,$r16
> >    120000a7c:   0012a9aa        sltu    $r10,$r13,$r10
>
> Add the carry bits onto the sums.
> I've not quite worked out which add is which!
> But I think you've missed a few adds here.
>
> >    120000a80:   0010ae0f        add.d   $r15,$r16,$r11
> >    120000a84:   0010ddce        add.d   $r14,$r14,$r23
> >    120000a88:   0010b250        add.d   $r16,$r18,$r12
> >    120000a8c:   0010b54d        add.d   $r13,$r10,$r13
> >    120000a90:   0010b5d2        add.d   $r18,$r14,$r13
> >    120000a94:   0010c1f0        add.d   $r16,$r15,$r16
>
> Somewhere each value needs an add, an sltu to generate the 'carry',
> and an add for the carry itself.
> If you sum the carry bits into a separate register it is
> possible to get a both adds and the sltu (for different values)
> to run in the same clock (on a suitable cpu).
> If there are 4 integer units you can also get the loop instructions
> 'for free' and unrolling 8 times may not be needed at all.
>
> ...
> > There is no post-index addressing modes on loongarch,
> >       val = *mem;  // 64bit read
> >         mem++;
> >       sum += val;
> >       carry = sum < val;
> >       carry_sum += carry;
> > it takes 5 instruction and these 5 instructions depends on previous instr.
>
> I'd assume the loop was unrolled enough so the address
> increment doesn't matter.
>
> > There is the piece of disassemble code:
> >    120000d90:   28c001f0        ld.d    $r16,$r15,0
> >    120000d94:   0010c58c        add.d   $r12,$r12,$r17
> >    120000d98:   02c021ef        addi.d  $r15,$r15,8(0x8)
>
> Those three instructions are independent.
>
> >    120000d9c:   0010b20c        add.d   $r12,$r16,$r12
>
> that one depends on the ld.d
>
> >    120000da0:   0012c191        sltu    $r17,$r12,$r16
>
> that depends on the add.d
> but it could be execute after the 'bne' in parallel with the ld.d
>
> >    120000da4:   5fffedf2        bne     $r15,$r18,-20(0x3ffec) # 120000d90 <do_csum_64+0x90>
>
> If you tweak the code it is possible to get down to just
> the addi.d and bne constraining the dependency chain.
> (Assuming there is no delay on the read and there are an infinite
> number of execution units.)
> Unroll once and do:
>         ld.d r,addr,0
>         addi.d addr,16
>         ld.d r,addr,-8
>         bne addr,limit,loop_top
> and you might get a loop that does a memory read every clock.
>
> So you end up worrying about how the memory read delays affect
> the instruction pipeline.
> The Intel x86 cpu I've got just pile up the arithmetic instructions
> waiting for the data to be read.
> If you get a memory read requested every clock everything else
> follows - provided you don't try to execute too many instrcutions
> at once.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ