[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UebO62GBsmL17JrZW0Ptzmr05buc1x6pHv6A_PAr4HBLQ@mail.gmail.com>
Date: Wed, 9 Mar 2016 08:08:18 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Tom Herbert <tom@...bertland.com>
Cc: Joe Perches <joe@...ches.com>,
Alexander Duyck <aduyck@...antis.com>,
Netdev <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>
Subject: Re: [net-next PATCH] csum: Update csum_block_add to use rotate
instead of byteswap
On Tue, Mar 8, 2016 at 10:31 PM, Tom Herbert <tom@...bertland.com> wrote:
> On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
> <alexander.duyck@...il.com> wrote:
>> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@...ches.com> wrote:
>>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@...ches.com> wrote:
>>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>>> > > odd bytes of the checksum if the offset was odd. Instead of doing this we
>>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>>> > > same effect in terms of the final checksum value and only requires one
>>>> > > instruction.
>>>> > 3 instructions?
>>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>>
>>>> I assume when you say 3 you are including the test and either some
>>>> form of conditional move or jump?
>>>
>>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>>
>> Right. But the general idea is that rotate is an instruction most
>> architectures have. I haven't heard of an instruction that swaps even
>> and odd bytes of a 32 bit word.
>>
> Yes, I took a look inlining these.
>
> #define rol32(V, X) ({ \
> int word = V; \
> if (__builtin_constant_p(X)) \
> asm("roll $" #X ",%[word]\n\t" \
> : [word] "=r" (word)); \
> else \
> asm("roll %%cl,%[word]\n\t" \
> : [word] "=r" (word) \
> : "c" (X)); \
> word; \
> })
>
> With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...
Is gcc really not converting the rol32 calls into rotates?
If we need this type of code in order to get the rotates to occur as
expected then maybe we need to look at doing arch specific versions of
the functions in bitops.h in order to improve the performance since I
know these calls are used in some performance critical paths such as
crypto and hashing.
- Alex
Powered by blists - more mailing lists