netdev - Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <91fafe40-7856-8b22-c279-55df5d06ca39@gmail.com>
Date:   Thu, 5 Mar 2020 09:19:43 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     David Laight <David.Laight@...LAB.COM>,
        'Willem de Bruijn' <willemdebruijn.kernel@...il.com>,
        Yadu Kishore <kyk.segfault@...il.com>
Cc:     David Miller <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming



On 3/5/20 9:00 AM, David Laight wrote:
> From: Willem de Bruijn
>> Sent: 05 March 2020 16:07
> ..
>> It seems do_csum is called because csum_partial_copy executes the
>> two operations independently:
>>
>> __wsum
>> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
>> {
>>         memcpy(dst, src, len);
>>         return csum_partial(dst, len, sum);
>> }
> 
> And do_csum() is superbly horrid.
> Not the least because it is 32bit on 64bit systems.

There are many versions, which one is discussed here ?

At least the current one seems to be 64bit optimized.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5777eaed566a1d63e344d3dd8f2b5e33be20643e


> 
> A better inner loop (even on 32bit) would be:
> 	u64 result = 0; // better old 'sum' the caller wants to add in.
> 	...
> 	do {
> 		result += *(u32 *)buff;
> 		buff += 4;
> 	} while (buff < end);
> 
> (That is as fast as the x86 'adc' loop on intel x86 cpus
> prior to Haswell!)
> Tweaking the above might cause gcc to generate code that
> executes one iteration per clock on some superscaler cpus.
> Adding alternate words to different registers may be
> beneficial on cpus that can do two memory reads in 1 clock.
> 
> Then reduce from 64bits down to 16. Maybe with:
> 	if (odd)
> 		result <<= 8;
> 	result = (u32)result + (result >> 32);
> 	result32 = (u16)result + (result >> 16);
> 	result32 = (u16)result32 + (result32 >> 16);
> 	result32 = (u16)result32 + (result32 >> 16);
> I think you need 4 reduces.
> Modulo 0xffff might generate faster code, but the result domain
> is then 0..0xfffe not 1..0xffff.
> Which is actually better because it is correct when inverted.
> 
> If reducing with adds, it is best to initialise the csum to 1.
> Then add 1 after inverting.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>