[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Uf9t9V65giwiwGAYbrd8SA9h8ULQj-4u4Ju=s5UaETFPQ@mail.gmail.com>
Date: Fri, 26 Feb 2016 19:40:50 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Tom Herbert <tom@...bertland.com>
Cc: David Miller <davem@...emloft.net>,
Netdev <netdev@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Peter Anvin <hpa@...or.com>,
"the arch/x86 maintainers" <x86@...nel.org>,
Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64
On Fri, Feb 26, 2016 at 7:11 PM, Tom Herbert <tom@...bertland.com> wrote:
> On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
> <alexander.duyck@...il.com> wrote:
>> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <tom@...bertland.com> wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>> - went back to C code with inline assembly for critical routines
>>> - implemented suggestion from Linus to deal with lengths < 8
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>> Old New %
>>> Len/Aln nsecs nsecs Improv
>>> --------+-------+--------+-------
>>> 1400/0 195.6 181.7 7% (Big packet)
>>> 40/0 11.4 6.2 45% (Ipv6 hdr cmn case)
>>> 8/4 7.9 3.2 59% (UDP, VXLAN in IPv4)
>>> 14/0 8.9 5.9 33% (Eth hdr)
>>> 14/4 9.2 5.9 35% (Eth hdr in IPv4)
>>> 14/3 9.6 5.9 38% (Eth with odd align)
>>> 20/0 9.0 6.2 31% (IP hdr without options)
>>> 7/1 8.9 4.2 52% (buffer in one quad)
>>> 100/0 17.4 13.9 20% (medium-sized pkt)
>>> 100/2 17.8 14.2 20% (medium-sized pkt w/ alignment)
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also tested on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between the
>>> two scenarios (perf stat output is below), neither does either case
>>> show a significant number of branch misses.
>>>
>>> Interleave lengths case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>> ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>>
>>> Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>> 9,556,693,202 instructions ( +- 0.00% )
>>> 1,176,208,640 branches ( +- 0.00% )
>>> 19,487 branch-misses # 0.00% of all branches ( +- 6.07% )
>>>
>>> 2.049732539 seconds time elapsed
>>>
>>> Non-interleave case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>> ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>>
>>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>> 9,782,188,310 instructions ( +- 0.00% )
>>> 1,251,286,958 branches ( +- 0.01% )
>>> 18,950 branch-misses # 0.00% of all branches ( +- 12.74% )
>>>
>>> 2.271789046 seconds time elapsed
>>>
>>> Signed-off-by: Tom Herbert <tom@...bertland.com>
>>> ---
>>> arch/x86/include/asm/checksum_64.h | 21 ++++
>>> arch/x86/lib/csum-partial_64.c | 225 ++++++++++++++++++++-----------------
>>> 2 files changed, 143 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>>> index cd00e17..e20c35b 100644
>>> --- a/arch/x86/include/asm/checksum_64.h
>>> +++ b/arch/x86/include/asm/checksum_64.h
>>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>> return a;
>>> }
>>>
>>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>>> +{
>>> + asm("addq %2,%0\n\t"
>>> + "adcq $0,%0"
>>> + : "=r" (a)
>>> + : "0" (a), "rm" (b));
>>> + return a;
>>> +}
>>> +
>>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>>> + unsigned int c)
>>> +{
>>> + asm("addl %2,%0\n\t"
>>> + "adcl %3,%0\n\t"
>>> + "adcl $0,%0"
>>> + : "=r" (a)
>>> + : "" (a), "rm" (b), "rm" (c));
>>> +
>>> + return a;
>>> +}
>>> +
>>> #define HAVE_ARCH_CSUM_ADD
>>> static inline __wsum csum_add(__wsum csum, __wsum addend)
>>> {
>>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>>> index 9845371..df82c9b 100644
>>> --- a/arch/x86/lib/csum-partial_64.c
>>> +++ b/arch/x86/lib/csum-partial_64.c
>>> @@ -8,114 +8,66 @@
>>> #include <linux/compiler.h>
>>> #include <linux/module.h>
>>> #include <asm/checksum.h>
>>> +#include <asm/word-at-a-time.h>
>>>
>>> -static inline unsigned short from32to16(unsigned a)
>>> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
>>> {
>>> - unsigned short b = a >> 16;
>>> - asm("addw %w2,%w0\n\t"
>>> - "adcw $0,%w0\n"
>>> - : "=r" (b)
>>> - : "0" (b), "r" (a));
>>> - return b;
>>> + if (unlikely(aligned))
>>> + asm("rorq $0x8,%0"
>>> + : "=r" (sum)
>>> + : "0" (sum));
>>> + return sum;
>>> }
>>
>> Instead of making this conditional you might just want to perform the
>> rotate always and just bump the aligned value up bits. That way you
>> reduce the length of the dependency chain by 1, and rotates tend to
>> only take a cycle on most x86_64 capable systems anyways.
>>
> I don't understand what you want. Can you provide more detail?
So right now you have aligned as a boolean value. The code path ends
up being something like cmp, jmp, rorq. If you converted it to an
integer value you could simply shift aligned by 3 and then perform the
rotate in all cases as a rotate by 0 has no effect. It basically just
drops one uop from the path.
>>>
>>> -/*
>>> - * Do a 64-bit checksum on an arbitrary memory area.
>>> - * Returns a 32bit checksum.
>>> - *
>>> - * This isn't as time critical as it used to be because many NICs
>>> - * do hardware checksumming these days.
>>> - *
>>> - * Things tried and found to not make it faster:
>>> - * Manual Prefetching
>>> - * Unrolling to an 128 bytes inner loop.
>>> - * Using interleaving with more registers to break the carry chains.
>>> - */
>>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>>> +/* Extract eight high order (nbo) bytes of quad "val". */
>>> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
>>> {
>>> - unsigned odd, count;
>>> - unsigned long result = 0;
>>> + return val & ((1ul << len * 8) - 1);
>>> +}
>>>
>>> - if (unlikely(len == 0))
>>> - return result;
>>> - odd = 1 & (unsigned long) buff;
>>> - if (unlikely(odd)) {
>>> - result = *buff << 8;
>>> - len--;
>>> - buff++;
>>> - }
>>> - count = len >> 1; /* nr of 16-bit words.. */
>>> - if (count) {
>>> - if (2 & (unsigned long) buff) {
>>> - result += *(unsigned short *)buff;
>>> - count--;
>>> - len -= 2;
>>> - buff += 2;
>>> - }
>>> - count >>= 1; /* nr of 32-bit words.. */
>>> - if (count) {
>>> - unsigned long zero;
>>> - unsigned count64;
>>> - if (4 & (unsigned long) buff) {
>>> - result += *(unsigned int *) buff;
>>> - count--;
>>> - len -= 4;
>>> - buff += 4;
>>> - }
>>> - count >>= 1; /* nr of 64-bit words.. */
>>> -
>>> - /* main loop using 64byte blocks */
>>> - zero = 0;
>>> - count64 = count >> 3;
>>> - while (count64) {
>>> - asm("addq 0*8(%[src]),%[res]\n\t"
>>> - "adcq 1*8(%[src]),%[res]\n\t"
>>> - "adcq 2*8(%[src]),%[res]\n\t"
>>> - "adcq 3*8(%[src]),%[res]\n\t"
>>> - "adcq 4*8(%[src]),%[res]\n\t"
>>> - "adcq 5*8(%[src]),%[res]\n\t"
>>> - "adcq 6*8(%[src]),%[res]\n\t"
>>> - "adcq 7*8(%[src]),%[res]\n\t"
>>> - "adcq %[zero],%[res]"
>>> - : [res] "=r" (result)
>>> - : [src] "r" (buff), [zero] "r" (zero),
>>> - "[res]" (result));
>>> - buff += 64;
>>> - count64--;
>>> - }
>>> -
>>> - /* last up to 7 8byte blocks */
>>> - count %= 8;
>>> - while (count) {
>>> - asm("addq %1,%0\n\t"
>>> - "adcq %2,%0\n"
>>> - : "=r" (result)
>>> - : "m" (*(unsigned long *)buff),
>>> - "r" (zero), "0" (result));
>>> - --count;
>>> - buff += 8;
>>> - }
>>> - result = add32_with_carry(result>>32,
>>> - result&0xffffffff);
>>> -
>>> - if (len & 4) {
>>> - result += *(unsigned int *) buff;
>>> - buff += 4;
>>> - }
>>> - }
>>> - if (len & 2) {
>>> - result += *(unsigned short *) buff;
>>> - buff += 2;
>>> - }
>>> - }
>>> - if (len & 1)
>>> - result += *buff;
>>> - result = add32_with_carry(result>>32, result & 0xffffffff);
>>> - if (unlikely(odd)) {
>>> - result = from32to16(result);
>>> - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>>> +/* Extract eight low order (nbo) bytes of quad in "val". */
>>> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
>>> +{
>>> + return val >> (8 * (8 - len));
>>> +}
>>
>> From what I can tell these functions only get used in one or two
>> spots. It may be worth it to just place the code directly in the
>> function rather than obscuring it by placing it in a function.
>>
>>> +/* Sum over buffer up to 64 bytes. */
>>> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
>>> + unsigned long sum)
>>> +{
>>> + asm("lea 40f(, %[slen], 4), %%r11\n\t"
>>> + "clc\n\t"
>>> + "jmpq *%%r11\n\t"
>>> + "adcq 7*8(%[src]),%[res]\n\t"
>>
>> Technically this first adcq could just be an addq. You don't have
>> anything to carry anyway.
>>
> To what end?
It basically allows us to get back to what I was proposing when we
were at netconf where we look at merging the main loop and your jump
table path so that you can put together a Duff's device and save
yourself some code overhead.
>>> + "adcq 6*8(%[src]),%[res]\n\t"
>>> + "adcq 5*8(%[src]),%[res]\n\t"
>>> + "adcq 4*8(%[src]),%[res]\n\t"
>>> + "adcq 3*8(%[src]),%[res]\n\t"
>>> + "adcq 2*8(%[src]),%[res]\n\t"
>>> + "adcq 1*8(%[src]),%[res]\n\t"
>>> + "adcq 0*8(%[src]),%[res]\n\t"
>>> + "nop\n\t"
>>> + "40: adcq $0,%[res]"
>>> + : [res] "=r" (sum)
>>> + : [src] "r" (buff),
>>> + [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>>> + : "r11");
>>> +
>>
>> It would probably useful to add a comment explaining that this block.
>> I had to compile it to verify my understanding of this is correct in
>> that you are using the label 40 as a jump target and simply
>> subtracting the length in longs from that jump target. If nothing
>> else you might want to rename the label from 40 to something else just
>> to improve the readability. It is possible that the code could get
>> shifted around or inlined and at that point the 40 becomes completely
>> meaningless.
>>
>> Also it wouldn't hurt to explain that the nop, and the reason for the
>> 4 in the lea instruction is because adcq is 4 bytes per instruction
>> with a memory offset, and the last one has an offset of 0 so we need
>> to add the nop to keep everything 4 byte aligned from the end.
>>
>>> + if (len & 0x7) {
>>> + unsigned long val;
>>> + /*
>>> + * Since "len" is > 8 here we backtrack in the buffer to load
>>> + * the outstaning bytes into the low order bytes of a quad and
>>> + * sum those in csum_partial_lt8_tail. By doing this we avoid
>>> + * additional calls to load_unaligned_zeropad.
>>> + */
>>> +
>>> + val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
>>> + len & 0x7);
>>> + sum = add64_with_carry(val, sum);
>>> }
>>> - return result;
>>> +
>>> + return sum;
>>> }
>>>
>>> /*
>>> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>> *
>>> * it's best to have buff aligned on a 64-bit boundary
>>> */
>>> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
>>> +{
>>> + bool aligned = false;
>>> + unsigned long result = 0;
>>> +
>>> + /*
>>> + * For "small" lengths don't bother with alignment, x86 handles
>>> + * this pretty well.
>>> + */
>>> + if (len <= 8) {
>>> + unsigned long val;
>>> +
>>> + /* Optimize trivial cases */
>>> + if (len == 8)
>>> + return add32_with_carry3(sum,
>>> + *(unsigned int *)buff,
>>> + *(unsigned int *)(buff + 4));
>>
>> Doesn't this still run the risk of triggering a fault if the buffer
>> crosses pages on a non-4 byte aligned boundary?
>>
> It's safe because we know the length of the buffer is eight bytes.
> There is only one instance where we need load_unaligned_zeropad to
> deal with potentially crossing over to a bad page.
I think I see what you are getting at, but to me it just seems like
there is some unnecessary optimizations going on. For example I don't
really see the point in optimizing for the 0 case below. If someone
requested a sum over 0 bytes I believe the lt8_head function will
return 0 anyway.
You might also be better off just converting this to a (len < 7) as
well and let the code fall through into the section for less than 64
as well.
>>> + else if (len == 0)
>>> + return sum;
>>> +
>>> + val = load_unaligned_zeropad(buff);
>>> + result = csum_partial_lt8_head(val, len);
>>> +
>>> + return add32_with_carry3(sum, result >> 32,
>>> + result & 0xffffffff);
>>> + } else if (len <= 64) {
>>> + result = csum_partial_le64(buff, len, result);
>>> +
>>> + return add32_with_carry3(sum, result >> 32,
>>> + result & 0xffffffff);
>>> + }
>>> +
>>
>> One side effect I see from the fact that you are calling
>> csum_partial_le64 in two spots is that you are ending up having to
>> make a call to it instead of having it be inlined directly into your
>> do_csum function. What you may want to do is look at altering your
>> code flow so that if (len > 64) you do the bits below that occur
>> before you call into csum_partial_le64, and that way you will reduce
>> the number of callers to csum_partial_le64 to one which should get the
>> compiler to inline it into your function.
>>
> I see that gcc is inlining both calls to csum_partial_le64.
Maybe your compiler is smarter than mine, or I have some other build
option enabled that isn't triggering that.
>>> + /*
>>> + * Length is greater than 64. Sum to eight byte alignment before
>>> + * proceeding with main loop.
>>> + */
>>> + aligned = !!((unsigned long)buff & 0x1);
>>> + if (aligned) {
>>> + unsigned int align = 7 & -(unsigned long)buff;
>>> +
>>> + result = csum_partial_lt8_head(*(unsigned long *)buff, align);
>>> + buff += align;
>>> + len -= align;
>>> + result = rotate_by8_if_odd(result, align);
>>> + }
>>
>> I'm still not a fan of the unaligned reads. They may be okay but it
>> just seems like we are going run into corner cases all over the place
>> where this ends up biting us.
>>
> They are already all over the place in stack. For instance, any time
> we process an IP header within VXLAN we're likely doing several
> unaligned reads to load IP addresses etc. These are a big problem for
> architectures like Sparc, but does not appear to be an issue for x86.
Yes, but in our case we can align ourselves with minimal overhead,
that is why I figure it might be worth the effort to do so.
>>> + /* main loop using 64byte blocks */
>>> + for (; len >= 64; len -= 64, buff += 64) {
>>> + asm("addq 0*8(%[src]),%[res]\n\t"
>>> + "adcq 1*8(%[src]),%[res]\n\t"
>>> + "adcq 2*8(%[src]),%[res]\n\t"
>>> + "adcq 3*8(%[src]),%[res]\n\t"
>>> + "adcq 4*8(%[src]),%[res]\n\t"
>>> + "adcq 5*8(%[src]),%[res]\n\t"
>>> + "adcq 6*8(%[src]),%[res]\n\t"
>>> + "adcq 7*8(%[src]),%[res]\n\t"
>>> + "adcq $0,%[res]"
>>> + : [res] "=r" (result)
>>> + : [src] "r" (buff),
>>> + "[res]" (result));
>>> + }
>>> +
>>> + result = csum_partial_le64(buff, len, result);
>>> + result = rotate_by8_if_odd(result, aligned);
>>> +
>>> + return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
>>> +}
>>> +
I'll play around with this code a bit over the weekend. I have a few
ideas I want to play with.
I figure we should be able to get to the point of having this code
running pretty lean. For example I was looking at the Duff's device
idea and I think we can probably get this code quite a bit smaller and
faster.
As far as the patch itself I have no objections. It is better then
what we have now. I think I am just starting to realize we might be
able to squeeze some more speed out of it and might even be able to
reduce the code size in the process.
- Alex
Powered by blists - more mailing lists