netdev - Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S37eFv+iPtteiee4AVgKxDcVgMsOck-b2280Vy-MHHMFkw@mail.gmail.com>
Date:	Fri, 26 Feb 2016 19:11:58 -0800
From:	Tom Herbert <tom@...bertland.com>
To:	Alexander Duyck <alexander.duyck@...il.com>
Cc:	David Miller <davem@...emloft.net>,
	Netdev <netdev@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Peter Anvin <hpa@...or.com>,
	"the arch/x86 maintainers" <x86@...nel.org>,
	Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
<alexander.duyck@...il.com> wrote:
> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <tom@...bertland.com> wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>
>> - v4
>>    - went back to C code with inline assembly for critical routines
>>    - implemented suggestion from Linus to deal with lengths < 8
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>              Old      New     %
>>     Len/Aln  nsecs    nsecs   Improv
>>     --------+-------+--------+-------
>>     1400/0    195.6    181.7   7%     (Big packet)
>>     40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
>>     8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
>>     14/0      8.9      5.9     33%    (Eth hdr)
>>     14/4      9.2      5.9     35%    (Eth hdr in IPv4)
>>     14/3      9.6      5.9     38%    (Eth with odd align)
>>     20/0      9.0      6.2     31%    (IP hdr without options)
>>     7/1       8.9      4.2     52%    (buffer in one quad)
>>     100/0    17.4     13.9     20%    (medium-sized pkt)
>>     100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also tested on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch  prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between the
>> two scenarios (perf stat output is below), neither does either case
>> show a significant number of branch misses.
>>
>> Interleave lengths case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>
>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,556,693,202      instructions               ( +-  0.00% )
>>      1,176,208,640       branches                                                     ( +-  0.00% )
>>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>>
>>        2.049732539 seconds time elapsed
>>
>>     Non-interleave case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>
>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,782,188,310      instructions               ( +-  0.00% )
>>      1,251,286,958       branches                                                     ( +-  0.01% )
>>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>>
>>        2.271789046 seconds time elapsed
>>
>> Signed-off-by: Tom Herbert <tom@...bertland.com>
>> ---
>>  arch/x86/include/asm/checksum_64.h |  21 ++++
>>  arch/x86/lib/csum-partial_64.c     | 225 ++++++++++++++++++++-----------------
>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>> index cd00e17..e20c35b 100644
>> --- a/arch/x86/include/asm/checksum_64.h
>> +++ b/arch/x86/include/asm/checksum_64.h
>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>         return a;
>>  }
>>
>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>> +{
>> +       asm("addq %2,%0\n\t"
>> +           "adcq $0,%0"
>> +           : "=r" (a)
>> +           : "0" (a), "rm" (b));
>> +       return a;
>> +}
>> +
>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>> +                                            unsigned int c)
>> +{
>> +       asm("addl %2,%0\n\t"
>> +           "adcl %3,%0\n\t"
>> +           "adcl $0,%0"
>> +           : "=r" (a)
>> +           : "" (a), "rm" (b), "rm" (c));
>> +
>> +       return a;
>> +}
>> +
>>  #define HAVE_ARCH_CSUM_ADD
>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>  {
>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> index 9845371..df82c9b 100644
>> --- a/arch/x86/lib/csum-partial_64.c
>> +++ b/arch/x86/lib/csum-partial_64.c
>> @@ -8,114 +8,66 @@
>>  #include <linux/compiler.h>
>>  #include <linux/module.h>
>>  #include <asm/checksum.h>
>> +#include <asm/word-at-a-time.h>
>>
>> -static inline unsigned short from32to16(unsigned a)
>> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
>>  {
>> -       unsigned short b = a >> 16;
>> -       asm("addw %w2,%w0\n\t"
>> -           "adcw $0,%w0\n"
>> -           : "=r" (b)
>> -           : "0" (b), "r" (a));
>> -       return b;
>> +       if (unlikely(aligned))
>> +               asm("rorq $0x8,%0"
>> +                       : "=r" (sum)
>> +                       : "0" (sum));
>> +       return sum;
>>  }
>
> Instead of making this conditional you might just want to perform the
> rotate always and just bump the aligned value up bits.  That way you
> reduce the length of the dependency chain by 1, and rotates tend to
> only take a cycle on most x86_64 capable systems anyways.
>
I don't understand what you want. Can you provide more detail?

>>
>> -/*
>> - * Do a 64-bit checksum on an arbitrary memory area.
>> - * Returns a 32bit checksum.
>> - *
>> - * This isn't as time critical as it used to be because many NICs
>> - * do hardware checksumming these days.
>> - *
>> - * Things tried and found to not make it faster:
>> - * Manual Prefetching
>> - * Unrolling to an 128 bytes inner loop.
>> - * Using interleaving with more registers to break the carry chains.
>> - */
>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>> +/* Extract eight high order (nbo) bytes of quad "val". */
>> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
>>  {
>> -       unsigned odd, count;
>> -       unsigned long result = 0;
>> +       return val & ((1ul << len * 8) - 1);
>> +}
>>
>> -       if (unlikely(len == 0))
>> -               return result;
>> -       odd = 1 & (unsigned long) buff;
>> -       if (unlikely(odd)) {
>> -               result = *buff << 8;
>> -               len--;
>> -               buff++;
>> -       }
>> -       count = len >> 1;               /* nr of 16-bit words.. */
>> -       if (count) {
>> -               if (2 & (unsigned long) buff) {
>> -                       result += *(unsigned short *)buff;
>> -                       count--;
>> -                       len -= 2;
>> -                       buff += 2;
>> -               }
>> -               count >>= 1;            /* nr of 32-bit words.. */
>> -               if (count) {
>> -                       unsigned long zero;
>> -                       unsigned count64;
>> -                       if (4 & (unsigned long) buff) {
>> -                               result += *(unsigned int *) buff;
>> -                               count--;
>> -                               len -= 4;
>> -                               buff += 4;
>> -                       }
>> -                       count >>= 1;    /* nr of 64-bit words.. */
>> -
>> -                       /* main loop using 64byte blocks */
>> -                       zero = 0;
>> -                       count64 = count >> 3;
>> -                       while (count64) {
>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>> -                                   "adcq %[zero],%[res]"
>> -                                   : [res] "=r" (result)
>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>> -                                   "[res]" (result));
>> -                               buff += 64;
>> -                               count64--;
>> -                       }
>> -
>> -                       /* last up to 7 8byte blocks */
>> -                       count %= 8;
>> -                       while (count) {
>> -                               asm("addq %1,%0\n\t"
>> -                                   "adcq %2,%0\n"
>> -                                           : "=r" (result)
>> -                                   : "m" (*(unsigned long *)buff),
>> -                                   "r" (zero),  "0" (result));
>> -                               --count;
>> -                                       buff += 8;
>> -                       }
>> -                       result = add32_with_carry(result>>32,
>> -                                                 result&0xffffffff);
>> -
>> -                       if (len & 4) {
>> -                               result += *(unsigned int *) buff;
>> -                               buff += 4;
>> -                       }
>> -               }
>> -               if (len & 2) {
>> -                       result += *(unsigned short *) buff;
>> -                       buff += 2;
>> -               }
>> -       }
>> -       if (len & 1)
>> -               result += *buff;
>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>> -       if (unlikely(odd)) {
>> -               result = from32to16(result);
>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>> +/* Extract eight low order (nbo) bytes of quad in "val". */
>> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
>> +{
>> +       return val >> (8 * (8 - len));
>> +}
>
> From what I can tell these functions only get used in one or two
> spots.  It may be worth it to just place the code directly in the
> function rather than obscuring it by placing it in a function.
>
>> +/* Sum over buffer up to 64 bytes. */
>> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
>> +                                      unsigned long sum)
>> +{
>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>> +           "clc\n\t"
>> +           "jmpq *%%r11\n\t"
>> +           "adcq 7*8(%[src]),%[res]\n\t"
>
> Technically this first adcq could just be an addq.  You don't have
> anything to carry anyway.
>
To what end?

>> +           "adcq 6*8(%[src]),%[res]\n\t"
>> +           "adcq 5*8(%[src]),%[res]\n\t"
>> +           "adcq 4*8(%[src]),%[res]\n\t"
>> +           "adcq 3*8(%[src]),%[res]\n\t"
>> +           "adcq 2*8(%[src]),%[res]\n\t"
>> +           "adcq 1*8(%[src]),%[res]\n\t"
>> +           "adcq 0*8(%[src]),%[res]\n\t"
>> +           "nop\n\t"
>> +           "40: adcq $0,%[res]"
>> +                   : [res] "=r" (sum)
>> +                   : [src] "r" (buff),
>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>> +                   : "r11");
>> +
>
> It would probably useful to add a comment explaining that this block.
> I had to compile it to verify my understanding of this is correct in
> that you are using the label 40 as a jump target and simply
> subtracting the length in longs from that jump target.  If nothing
> else you might want to rename the label from 40 to something else just
> to improve the readability.  It is possible that the code could get
> shifted around or inlined and at that point the 40 becomes completely
> meaningless.
>
> Also it wouldn't hurt to explain that the nop, and the reason for the
> 4 in the lea instruction is because adcq is 4 bytes per instruction
> with a memory offset, and the last one has an offset of 0 so we need
> to add the nop to keep everything 4 byte aligned from the end.
>
>> +       if (len & 0x7) {
>> +               unsigned long val;
>> +               /*
>> +                * Since "len" is > 8 here we backtrack in the buffer to load
>> +                * the outstaning bytes into the low order bytes of a quad and
>> +                * sum those in csum_partial_lt8_tail. By doing this we avoid
>> +                * additional calls to load_unaligned_zeropad.
>> +                */
>> +
>> +               val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
>> +                                           len & 0x7);
>> +               sum = add64_with_carry(val, sum);
>>         }
>> -       return result;
>> +
>> +       return sum;
>>  }
>>
>>  /*
>> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>   *
>>   * it's best to have buff aligned on a 64-bit boundary
>>   */
>> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
>> +{
>> +       bool aligned = false;
>> +       unsigned long result = 0;
>> +
>> +       /*
>> +        * For "small" lengths don't bother with alignment, x86 handles
>> +        * this pretty well.
>> +        */
>> +       if (len <= 8) {
>> +               unsigned long val;
>> +
>> +               /* Optimize trivial cases */
>> +               if (len == 8)
>> +                       return add32_with_carry3(sum,
>> +                                      *(unsigned int *)buff,
>> +                                      *(unsigned int *)(buff + 4));
>
> Doesn't this still run the risk of triggering a fault if the buffer
> crosses pages on a non-4 byte aligned boundary?
>
It's safe because we know the length of the buffer is eight bytes.
There is only one instance where we need load_unaligned_zeropad to
deal with potentially crossing over to a bad page.

>> +               else if (len == 0)
>> +                       return sum;
>> +
>> +               val = load_unaligned_zeropad(buff);
>> +               result = csum_partial_lt8_head(val, len);
>> +
>> +               return add32_with_carry3(sum, result >> 32,
>> +                                        result & 0xffffffff);
>> +       } else if (len <= 64) {
>> +               result = csum_partial_le64(buff, len, result);
>> +
>> +               return add32_with_carry3(sum, result >> 32,
>> +                                        result & 0xffffffff);
>> +       }
>> +
>
> One side effect I see from the fact that you are calling
> csum_partial_le64 in two spots is that you are ending up having to
> make a call to it instead of having it be inlined directly into your
> do_csum function.  What you may want to do is look at altering your
> code flow so that if (len > 64) you do the bits below that occur
> before you call into csum_partial_le64, and that way you will reduce
> the number of callers to csum_partial_le64 to one which should get the
> compiler to inline it into your function.
>
I see that gcc is inlining both calls to csum_partial_le64.

>> +       /*
>> +        * Length is greater than 64. Sum to eight byte alignment before
>> +        * proceeding with main loop.
>> +        */
>> +       aligned = !!((unsigned long)buff & 0x1);
>> +       if (aligned) {
>> +               unsigned int align = 7 & -(unsigned long)buff;
>> +
>> +               result = csum_partial_lt8_head(*(unsigned long *)buff, align);
>> +               buff += align;
>> +               len -= align;
>> +               result = rotate_by8_if_odd(result, align);
>> +       }
>
> I'm still not a fan of the unaligned reads.  They may be okay but it
> just seems like we are going run into corner cases all over the place
> where this ends up biting us.
>
They are already all over the place in stack. For instance, any time
we process an IP header within VXLAN we're likely doing several
unaligned reads to load IP addresses etc. These are a big problem for
architectures like Sparc, but does not appear to be an issue for x86.

>> +       /* main loop using 64byte blocks */
>> +       for (; len >= 64; len -= 64, buff += 64) {
>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>> +                   "adcq $0,%[res]"
>> +                   : [res] "=r" (result)
>> +                   : [src] "r" (buff),
>> +                   "[res]" (result));
>> +       }
>> +
>> +       result = csum_partial_le64(buff, len, result);
>> +       result = rotate_by8_if_odd(result, aligned);
>> +
>> +       return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
>> +}
>> +
>>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>>  {
>> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
>> -                                               (__force u32)sum);
>> +       return (__force __wsum)do_csum(buff, len, (__force u32)sum);
>>  }
>>
>>  /*
>> --
>> 2.6.5
>>