linux-kernel - Re: [PATCH 1/1] arm64: Add ARM64 optimized IP checksum routine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <25c62b93-672f-296f-ad8f-e8dd144a3a20@broadcom.com>
Date:	Wed, 11 May 2016 14:36:46 -0400
From:	Luke Starrett <luke.starrett@...adcom.com>
To:	Robin Murphy <robin.murphy@....com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will.deacon@....com>
Cc:	BCM Kernel Feedback <bcm-kernel-feedback-list@...adcom.com>,
	linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH 1/1] arm64: Add ARM64 optimized IP checksum routine

On 5/11/2016 8:55 AM, Robin Murphy wrote:
> Hi Luke,
>
> On 10/05/16 19:27, Luke Starrett wrote:
>> This change implements an optimized checksum for arm64, based loosely on
>> the original arch/arm implementation.  Load-pair is used for the initial
>> 16B load, reducing the overall number of loads to two for packets 
>> without
>> IP options.  Instruction count is reduced by ~3x compared to generic C
>> implementation.  Changes were tested against LE and BE operation and for
>> all possible mis-alignments.
>>
>> Signed-off-by: Luke Starrett <luke.starrett@...adcom.com>
>> ---
>>   arch/arm64/include/asm/checksum.h | 57 
>> +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 57 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/checksum.h
>>
>> diff --git a/arch/arm64/include/asm/checksum.h 
>> b/arch/arm64/include/asm/checksum.h
>> new file mode 100644
>> index 0000000..15629d7
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/checksum.h
>> @@ -0,0 +1,57 @@
>> +/*
>> + * Copyright 2016 Broadcom
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License, version 2, as
>> + * published by the Free Software Foundation (the "GPL").
>> + *
>> + * This program is distributed in the hope that it will be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License version 2 (GPLv2) for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * version 2 (GPLv2) along with this source code.
>> + *
>> + * ip_fast_csum() loosely derived from original arch/arm implementation
>> + */
>> +
>> +#ifndef __ASM_ARM64_CHECKSUM_H
>> +#define __ASM_ARM64_CHECKSUM_H
>> +
>> +/*
>> + * This is a version of ip_compute_csum() optimized for IP headers,
>> + * which always checksum on 4 octet boundaries.
>> + */
>> +static inline __sum16
>> +ip_fast_csum(const void *iph, unsigned int ihl)
>> +{
>> +    u64 tmp, sum;
>> +
>> +    __asm__ __volatile__ (
>> +"    ldp    %0, %3, [%1], #16\n"
>> +"    sub    %2, %2, #4\n"
>> +"    adds    %0, %0, %3\n"
>> +"1:    ldr    %w3, [%1], #4\n"
>> +"    sub    %2, %2, #1\n"
>> +"    adcs    %0, %0, %3\n"
>
> This looks suspicious - the following tst is going to zero the carry 
> flag, so either setting the flags is unnecessary, or things are 
> working by chance.
>
Yes, you are correct.  Thinking about this a bit, it is working by 
chance as you expected, given the second adcs is adding a 32b quantity 
to the prior 64b result and unlikely (though not impossible) to produce 
a carry.  I'll reorder this.

>> +"    tst    %2, #15\n"
>> +"    bne    1b\n"
>> +"    adc    %0, %0, xzr\n"
>
> Similarly, you should never get here with the carry flag set, so this 
> is just a pointless nop.
>
>> +"    ror    %3, %0, #32\n"
>> +"    add     %0, %0, %3\n"
>> +"    lsr    %0, %0, #32\n"
>> +"    ror    %w3, %w0, #16\n"
>> +"    add     %w0, %w0, %w3\n"
>> +"    lsr    %0, %0, #16\n"
>> +    : "=r" (sum), "=r" (iph), "=r" (ihl), "=r" (tmp)
>> +    : "1" (iph), "2" (ihl)
>> +    : "cc", "memory");
>> +
>> +    return (__force __sum16)(~sum);
>> +}
>
> Given that dodginess, and that there doesn't seem to be anything here 
> which absolutely _demands_ assembly voodoo, can we not just have an 
> optimised C implementation working on larger chunks, along the lines 
> of Alpha?
>
I would agree, though I couldn't see any obvious way to reduce from 
individual loads to 'ldp' without requiring some level of inlining, and 
at that point just decided to do the whole thing inline.  It was clear 
though that reducing to a single ldp + ldr in the mainstream case was 
beneficial.

Luke


> Robin.
>
>> +
>> +#define ip_fast_csum ip_fast_csum
>> +#include <asm-generic/checksum.h>
>> +
>> +#endif    /* __ASM_ARM64_CHECKSUM_H */
>>
>