[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56D43F33.4030702@c-s.fr>
Date: Mon, 29 Feb 2016 13:53:07 +0100
From: Christophe Leroy <christophe.leroy@....fr>
To: Scott Wood <scottwood@...escale.com>
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Paul Mackerras <paulus@...ba.org>,
Michael Ellerman <mpe@...erman.id.au>,
linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
netdev@...r.kernel.org
Subject: Re: [PATCH 6/9] powerpc32: optimise a few instructions in
csum_partial()
Le 23/10/2015 05:30, Scott Wood a écrit :
> On Tue, 2015-09-22 at 16:34 +0200, Christophe Leroy wrote:
>> r5 does contain the value to be updated, so lets use r5 all way long
>> for that. It makes the code more readable.
>>
>> To avoid confusion, it is better to use adde instead of addc
>>
>> The first addition is useless. Its only purpose is to clear carry.
>> As r4 is a signed int that is always positive, this can be done by
>> using srawi instead of srwi
>>
>> Let's also remove the comment about bdnz having no overhead as it
>> is not correct on all powerpc, at least on MPC8xx
>>
>> In the last part, in our situation, the remaining quantity of bytes
>> to be proceeded is between 0 and 3. Therefore, we can base that part
>> on the value of bit 31 and bit 30 of r4 instead of anding r4 with 3
>> then proceding on comparisons and substractions.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@....fr>
>> ---
>> arch/powerpc/lib/checksum_32.S | 37 +++++++++++++++++--------------------
>> 1 file changed, 17 insertions(+), 20 deletions(-)
> Do you have benchmarks for these optimizations?
>
> -Scott
Using mftbl() to get timebase just before and after call to
csum_partial(), I get the following on an MPC885:
* 78 bytes packets: 9% faster (11,5 to 10,4 tb ticks)
* 328 bytes packets: 3% faster (47,9 to 46,5 tb ticks)
Christophe
>
>> diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S
>> index 3472372..9c12602 100644
>> --- a/arch/powerpc/lib/checksum_32.S
>> +++ b/arch/powerpc/lib/checksum_32.S
>> @@ -27,35 +27,32 @@
>> * csum_partial(buff, len, sum)
>> */
>> _GLOBAL(csum_partial)
>> - addic r0,r5,0
>> subi r3,r3,4
>> - srwi. r6,r4,2
>> + srawi. r6,r4,2 /* Divide len by 4 and also clear carry */
>> beq 3f /* if we're doing < 4 bytes */
>> - andi. r5,r3,2 /* Align buffer to longword boundary */
>> + andi. r0,r3,2 /* Align buffer to longword boundary */
>> beq+ 1f
>> - lhz r5,4(r3) /* do 2 bytes to get aligned */
>> - addi r3,r3,2
>> + lhz r0,4(r3) /* do 2 bytes to get aligned */
>> subi r4,r4,2
>> - addc r0,r0,r5
>> + addi r3,r3,2
>> srwi. r6,r4,2 /* # words to do */
>> + adde r5,r5,r0
>> beq 3f
>> 1: mtctr r6
>> -2: lwzu r5,4(r3) /* the bdnz has zero overhead, so it should */
>> - adde r0,r0,r5 /* be unnecessary to unroll this loop */
>> +2: lwzu r0,4(r3)
>> + adde r5,r5,r0
>> bdnz 2b
>> - andi. r4,r4,3
>> -3: cmpwi 0,r4,2
>> - blt+ 4f
>> - lhz r5,4(r3)
>> +3: andi. r0,r4,2
>> + beq+ 4f
>> + lhz r0,4(r3)
>> addi r3,r3,2
>> - subi r4,r4,2
>> - adde r0,r0,r5
>> -4: cmpwi 0,r4,1
>> - bne+ 5f
>> - lbz r5,4(r3)
>> - slwi r5,r5,8 /* Upper byte of word */
>> - adde r0,r0,r5
>> -5: addze r3,r0 /* add in final carry */
>> + adde r5,r5,r0
>> +4: andi. r0,r4,1
>> + beq+ 5f
>> + lbz r0,4(r3)
>> + slwi r0,r0,8 /* Upper byte of word */
>> + adde r5,r5,r0
>> +5: addze r3,r5 /* add in final carry */
>> blr
>>
>> /*
Powered by blists - more mailing lists