lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 17 Feb 2024 22:47:31 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Charlie Jenkins' <charlie@...osinc.com>, Helge Deller <deller@...nel.org>
CC: Guenter Roeck <linux@...ck-us.net>, Helge Deller <deller@....de>, "James E
 . J . Bottomley" <James.Bottomley@...senpartnership.com>,
	"linux-parisc@...r.kernel.org" <linux-parisc@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Palmer Dabbelt
	<palmer@...osinc.com>
Subject: RE: [PATCH] parisc: Fix csum_ipv6_magic on 64-bit systems

..
> We can do better than this! By inspection this looks like a performance
> regression. The generic version of csum_fold in
> include/asm-generic/checksum.h is better than this so should be used
> instead.

Yes, that got changed for 6.8-rc1 (I pretty much suggested the patch)
but hadn't noticed Linus has applied it.
That C version is (probably) not worse than any of the asm versions
except sparc32 - which has a carry flag but rotate.
(It is better than the x86-64 asm one.)

..
> This doesn't leverage add with carry well. This causes the code size of this
> to be dramatically larger than the original assembly, which I assume
> nicely correlates to an increased execution time.

It is pretty much impossible to do add with carry from C.
So an asm adc block is pretty much always going to win.

For csum_partial and short to moderate length buffers
on x86 it is hard to beat 10: adc, adc, dec, jnz 10b
which (on modern intel cpu at least) does 8 bytes/clock.
You can get 12 bytes/clock but it only really wins for 256+ bytes.
(See the current x86-64 version.)

For cpu without a carry flag it is likely that a common C
function will be pretty much optimal on all architectures.
(Or maybe a couple of implementations based the actual
cpu implementation - not the architecture.)

Mostly I don't think you can beat 4 instructions/word, but they
will pipeline so with multi-issue you might get a read/clock. 
Arm's barrel shifter might give 3: v + *p; x += v, y += v >> 32.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ