lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e7bc85a3f92419f89117fc1381511be@AcuMS.aculab.com>
Date:   Wed, 17 Mar 2021 16:09:17 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'Maciej W. Rozycki'" <macro@...am.me.uk>,
        Tiezhu Yang <yangtiezhu@...ngson.cn>
CC:     Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
        "linux-mips@...r.kernel.org" <linux-mips@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Xuefeng Li <lixuefeng@...ngson.cn>
Subject: RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence
 with GCC in csum_tcpudp_nofold()

From: Maciej W. Rozycki
> Sent: 17 March 2021 15:36
..
> > > Not that I grok the mips opcodes.
> > > But that code has horridness on its side.
> 
>  It's a 32-bit one's-complement addition.  The use of 64-bit operations
> reduces the number of calculations as any 32-bit carries accumulate in the
> high 32-bit word allowing one instruction to be saved total compared to
> the 32-bit variant.  Nothing particularly unusual for me here; I've seen
> worse stuff with x86.

The 'problem' is that mips doesn't have a carry flag.
So the 64-bit maths is 'tricky'.
It may well be that a loop based on:
	do {
		val = *ptr++;
		sum += val;
		carry += sum < val;
	} while (ptr != limit)
will generate much better code.
I think there is a 'setlt' instruction for the compare.
It certainly would on the nios (which is mips-like).
That is (probably) 6 instructions for 4 bytes.
I suspect there may be a data stall after the memory read.
So an interleaved unroll would remove that stall.
That would be 10 clocks for 8 bytes.

The x86-64 code is 'interesting'.
It has repeated 'add carry' instructions.
On Intel cpus prior to (at least) Haswell they take two clocks each.
So the code is no faster than adding 32bit values to a 64bit sum.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ