lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Mar 2022 09:53:04 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Christophe Leroy' <christophe.leroy@...roup.eu>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Paul Mackerras <paulus@...ba.org>,
        "Michael Ellerman" <mpe@...erman.id.au>
CC:     "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] powerpc: Use rol32() instead of opencoding in csum_fold()

From: Christophe Leroy
> Sent: 09 March 2022 07:56
...
> diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h
> index 8321f6053a67..4b573a3b7e17 100644
> --- a/arch/powerpc/include/asm/checksum.h
> +++ b/arch/powerpc/include/asm/checksum.h
> @@ -38,14 +38,15 @@ extern __wsum csum_and_copy_to_user(const void *src, void __user *dst,
>   */
>  static inline __sum16 csum_fold(__wsum sum)
>  {
> +	u32 tmp = (__force u32)sum;
> +
> +	/*
> +	 * swap the two 16-bit halves of sum
> +	 * if there is a carry from adding the two 16-bit halves,
> +	 * it will carry from the lower half into the upper half,
> +	 * giving us the correct sum in the upper half.
> +	 */
> +	return (__force __sum16)(~(tmp + rol32(tmp, 16)) >> 16);
>  }
> 
>  static inline u32 from64to32(u64 x)
> --
> 2.34.1

On the face of it that is pretty generic.
Two shifts and an add (plus the invert and mask).

I suspect it generates better code than the current x86 version
which is:

static inline __sum16 csum_fold(__wsum sum)
{
	asm("addl %1, %0		;\n"
	    "adcl $0xffff, %0	;\n"
	    : "=r" (sum)
	    : "r" ((__force u32)sum << 16),
	      "0" ((__force u32)sum & 0xffff0000));
	return (__force __sum16)(~(__force u32)sum >> 16);
}

Which looks like 2 shifts, a mask, add, adc..
(Oh and the adc is two clocks on anything prior to Haswell.)
Quite why it doesn't use 16bit add and adc is anybodies guess.
Would still be shift, add, adc.
So shift, add, shift is no worse.

I wonder if any of the asm versions are actually better?
Some are the same algorithm, some are a lot worse.

Generic is:
static inline __sum16 csum_fold(__wsum csum)
{
	u32 sum = (__force u32)csum;
	sum = (sum & 0xffff) + (sum >> 16);
	sum = (sum & 0xffff) + (sum >> 16);
	return (__force __sum16)~sum;
}
Clearly can be improved.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ