linux-kernel - Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec49593a-60a4-be91-0fb2-af517eaf6d6a@loongson.cn>
Date:   Tue, 14 Feb 2023 09:31:07 +0800
From:   maobibo <maobibo@...ngson.cn>
To:     David Laight <David.Laight@...LAB.COM>,
        Huacai Chen <chenhuacai@...nel.org>
Cc:     WANG Xuerui <kernel@...0n.name>,
        Jiaxun Yang <jiaxun.yang@...goat.com>,
        "loongarch@...ts.linux.dev" <loongarch@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] LoongArch: add checksum optimization for 64-bit system



在 2023/2/10 19:08, David Laight 写道:
> From: maobibo
>> Sent: 10 February 2023 10:06
>>
>> With the test cases
>>   https://github.com/bibo-mao/bench/tree/master/csum
>>
>> Tested with different buffer size 4096/1472/250/40, here is the output on my
>> loongarch machine. Loops times is 0x100000, and time cost unit is milliseconds,
>> and the smaller value will be better.
>>
>>
>> buf size[4096] loops[0x100000] times[us]: csum uint128 344473 asm method 373391 uint64 741412
>> buf size[1472] loops[0x100000] times[us]: csum uint128 131849 asm method 138533 uint64 271317
>> buf size[ 250] loops[0x100000] times[us]: csum uint128 34512 asm method 36294 uint64 51576
>> buf size[  40] loops[0x100000] times[us]: csum uint128 12182 asm method 23874 uint64 15769
> 
> What do those work out as in bytes/clock?
> 
> Rather than run 1000s of iterations (and be hit by interrupts etc)
> I sometimes just use an accurate cycle counter and measure the
> time for a single buffer (or varying length).
> Save and print the values of 10 calls and you'll get pretty
> consistent values after the first couple (cold cache).
> Then you can work out how long each iteration of the main loop costs.
> 
> I think you have to execute 4 instructions for each 64bit word.
> One memory read, the main add, a setle and the add of the carry.
> 
> For a simple cpu that is always going to be 4 clocks.
> But if there are delay slots after the memory read you can
> fill them with the alu instructions for an earlier read.
> You also need an add and bne for the address for each iteration.
> So unrolling the loop further will help.
> 
> OTOH if your cpu can execute multiple instructions in one clock
> you can expect to do a lot better.
> With 3 ALU instructions (and one read) you should be able to
> find a code sequence that will run at 8 bytes/clock.
> With 4 ALU it is likely that the loop instructions can also
> execute in parallel - so you don't need massive loop unrolling.
> 
> Unless the cpu is massively 'out of order' (like some x86)
> I'd expect the best code to interleave the reads and alu
> operations for earlier values - rather than having all
> the reads at the top of the loop.
> So the loop would be a repeating pattern of instructions
> with some values being carried between iterations.
Part of asm code depends on previous intr in website
https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
such as macro ADDC
#define ADDC(sum,reg)                                           \
        ADD     sum, sum, reg;                                  \
        sltu    t8, sum, reg;                                   \
        ADD     sum, sum, t8;                                   \
these three instructions depends on each other, and can not execute
in parallel. 

The original of main loop about Lmove_128bytes is:
#define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3)     \
        LOAD    _t0, src, (offset + UNIT(0));                   \
        LOAD    _t1, src, (offset + UNIT(1));                   \
        LOAD    _t2, src, (offset + UNIT(2));                   \
        LOAD    _t3, src, (offset + UNIT(3));                   \
        ADDC(_t0, _t1);                                         \
        ADDC(_t2, _t3);                                         \
        ADDC(sum, _t0);                                         \
        ADDC(sum, _t2)

.Lmove_128bytes:
        CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
        CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
        CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
        CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
        addi.d  t5, t5, -1
        addi.d  src, src, 0x80
        bnez    t5, .Lmove_128bytes

I modified the main loop with label .Lmove_128bytes to reduce
dependency between instructions like this, it can improve the
performance.
can improve the performance.
.Lmove_128bytes:
        LOAD    t0, src, 0
        LOAD    t1, src, 8
        LOAD    t3, src, 16
        LOAD    t4, src, 24
        LOAD    a3, src, 0 + 0x20
        LOAD    a4, src, 8 + 0x20
        LOAD    a5, src, 16 + 0x20
        LOAD    a6, src, 24 + 0x20
        ADD     t0, t0,  t1
        ADD     t3, t3,  t4
        ADD     a3, a3,  a4
        ADD     a5, a5,  a6
        sltu    t8, t0,  t1
        sltu    a7, t3,  t4
        ADD     t0, t0,  t8
        ADD     t3, t3,  a7
        sltu    t1, a3,  a4
        sltu    t4, a5,  a6
        ADD     a3, a3,  t1
        ADD     a5, a5,  t4
        ADD     t0, t0, t3
        ADD     a3, a3, a5
        sltu    t1, t0, t3
        sltu    t4, a3, a5
        ADD     t0, t0, t1
        ADD     a3, a3, t4
        ADD     sum, sum, t0
        sltu    t8,  sum, t0
        ADD     sum, sum,  t8
        ADD     sum, sum, a3
        sltu    t8,  sum, a3
        addi.d  t5, t5, -1
        ADD     sum, sum, t8

However the result and principle is almost the similar with
uint128 c code. And there is no performance impact interleaving
the reads and alu operations.

Regards
Bibo, Mao

> 
> I doubt you'll get a loop to execute every clock, but
> a two clock loop is entirely possible.
> It rather depends how fast the instruction decoder
> handles the (predicted) branch.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)