netdev - Re: [net-next PATCH] csum: Update csum_block_add to use rotate instead of byteswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALx6S34WoWcK05J52q2z242Nyk7N1q9YjJDw7HsZ4s3VKWoonA@mail.gmail.com>
Date:	Tue, 8 Mar 2016 22:31:26 -0800
From:	Tom Herbert <tom@...bertland.com>
To:	Alexander Duyck <alexander.duyck@...il.com>
Cc:	Joe Perches <joe@...ches.com>,
	Alexander Duyck <aduyck@...antis.com>,
	Netdev <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>
Subject: Re: [net-next PATCH] csum: Update csum_block_add to use rotate
 instead of byteswap

On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck
<alexander.duyck@...il.com> wrote:
> On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <joe@...ches.com> wrote:
>> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote:
>>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <joe@...ches.com> wrote:
>>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote:
>>> > > The code for csum_block_add was doing a funky byteswap to swap the even and
>>> > > odd bytes of the checksum if the offset was odd.  Instead of doing this we
>>> > > can save ourselves some trouble and just shift by 8 as this should have the
>>> > > same effect in terms of the final checksum value and only requires one
>>> > > instruction.
>>> > 3 instructions?
>>> I was talking about just the one ror vs mov, shl, shr, and ,and, add.
>>>
>>> I assume when you say 3 you are including the test and either some
>>> form of conditional move or jump?
>>
>> Yeah, instruction count also depends on architecture (arm/x86/ppc...)
>
> Right.  But the general idea is that rotate is an instruction most
> architectures have.  I haven't heard of an instruction that swaps even
> and odd bytes of a 32 bit word.
>
Yes, I took a look inlining these.

#define rol32(V, X) ({                          \
        int word = V;                           \
        if (__builtin_constant_p(X))            \
                asm("roll $" #X ",%[word]\n\t"  \
                    : [word] "=r" (word));      \
        else                                    \
                asm("roll %%cl,%[word]\n\t"     \
                    : [word] "=r" (word)        \
                    : "c" (X));                 \
        word;                                   \
})

With this I'm seeing a nice speedup in jhash which uses a lot of rol32s...

>>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
>> []
>>> > > @@ -88,8 +88,10 @@ static inline __wsum
>>> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
>>> > >  {
>>> > >       u32 sum = (__force u32)csum2;
>>> > > -     if (offset&1)
>>> > > -             sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF);
>>> > > +
>>> > > +     if (offset & 1)
>>> > > +             sum = (sum << 24) + (sum >> 8);
>>> > Maybe use ror32(sum, 8);
>>> I was actually thinking I could use something like this.  I didn't
>>> realize it was even available.
>>
>> Now you know: bitops.h
>>
>>> > or maybe something like:
>>> >
>>> > {
>>> >         u32 sum;
>>> >
>>> >         /* rotated csum2 of odd offset will be the right checksum */
>>> >         if (offset & 1)
>>> >                 sum = ror32((__force u32)csum2, 8);
>>> >         else
>>> >                 sum = (__force u32)csum2;
>>> >
>>> Any specific reason for breaking it up like this?  It seems like it
>>> was easier to just have sum be assigned first and then rotating it if
>>> needed.  What is gained by splitting the assignment up over two
>>> different calls?
>>
>> It's only for reader clarity where a comment could be useful.
>> The compiler output shouldn't change.
>
> Okay, well I can add a one line comment about aligning to a 16b
> boundary for clarity.
>
> - Alex