netdev - Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALx6S34-3L9Y6T2xqQpTQeMhnu2jGQcm6MESBPyhyCb78hXCSA@mail.gmail.com>
Date:	Thu, 4 Feb 2016 11:31:52 -0800
From:	Tom Herbert <tom@...bertland.com>
To:	Alexander Duyck <alexander.duyck@...il.com>
Cc:	David Miller <davem@...emloft.net>,
	Netdev <netdev@...r.kernel.org>, tglx@...utronix.de,
	mingo@...hat.com, hpa@...or.com, x86@...nel.org,
	Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
<alexander.duyck@...il.com> wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@...bertland.com> wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
>>
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>> we need to avoid unaligned accesses. Efficient unaligned accesses
>> offer a nice additional speedup as demonstrated in the results
>> provided below.
>>
>> This implementation is similar to csum_partial implemented in
>> checksum_32.S, however since we are dealing with 8 bytes at a time
>> there are more cases for small lengths (and alignments)-- for that
>> we employ a jump table.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Checksum performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>          Old      NewA     NewA %   NewNoA   NewNoA %
>> Len/Aln  nsec     nsec     Improv   nsecs    Improve
>> --------+-------+--------+-------+-------+---------------------
>> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
>> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
>> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
>> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
>> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
>> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
>> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
>>
>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also test on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between
>> the baseline and the interleaving test case.
>>
>> Signed-off-by: Tom Herbert <tom@...bertland.com>
>
> So a couple of general questions about all this.
>
> First why do you break the alignment part out over so many reads?  It
> seems like you could just get the offset, subtract that from your
> starting buffer, read the aligned long, and then mask off the bits you
> don't want.  That should take maybe 4 or 5 instructions which would
> save you a bunch of jumping around since most of it is just
> arithmetic.
>
I will look at doing that.

> I am still not sure about the loops, but as mentioned you probably
> don't want to use loop.  You would likely be better off generating a
> pointer representing the last valid offset and just running through
> the loop that way.
>
> Then on the last read the same question as the first.  Why not just do
> one read and mask the result.  It should be faster.
>
> Finally I am not sure if your result is safe if the offset for the
> buffer is an odd value.  The original code did a collapse to 16b and
> rotate by 8 if the offset was odd.  I don't see anything like that in
> your code.

The roll $8 is done in the RETURN macro. We only need to worry about
this when alignment matters.

Thanks,
Tom

>
> - Alex