lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <817a6418ac8742e6bb872992711beb47@AcuMS.aculab.com>
Date:   Thu, 5 Mar 2020 17:00:09 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Willem de Bruijn' <willemdebruijn.kernel@...il.com>,
        Yadu Kishore <kyk.segfault@...il.com>
CC:     David Miller <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming

From: Willem de Bruijn
> Sent: 05 March 2020 16:07
..
> It seems do_csum is called because csum_partial_copy executes the
> two operations independently:
> 
> __wsum
> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
> {
>         memcpy(dst, src, len);
>         return csum_partial(dst, len, sum);
> }

And do_csum() is superbly horrid.
Not the least because it is 32bit on 64bit systems.

A better inner loop (even on 32bit) would be:
	u64 result = 0; // better old 'sum' the caller wants to add in.
	...
	do {
		result += *(u32 *)buff;
		buff += 4;
	} while (buff < end);

(That is as fast as the x86 'adc' loop on intel x86 cpus
prior to Haswell!)
Tweaking the above might cause gcc to generate code that
executes one iteration per clock on some superscaler cpus.
Adding alternate words to different registers may be
beneficial on cpus that can do two memory reads in 1 clock.

Then reduce from 64bits down to 16. Maybe with:
	if (odd)
		result <<= 8;
	result = (u32)result + (result >> 32);
	result32 = (u16)result + (result >> 16);
	result32 = (u16)result32 + (result32 >> 16);
	result32 = (u16)result32 + (result32 >> 16);
I think you need 4 reduces.
Modulo 0xffff might generate faster code, but the result domain
is then 0..0xfffe not 1..0xffff.
Which is actually better because it is correct when inverted.

If reducing with adds, it is best to initialise the csum to 1.
Then add 1 after inverting.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ