linux-kernel - RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6e7bc85a3f92419f89117fc1381511be@AcuMS.aculab.com>
Date:   Wed, 17 Mar 2021 16:09:17 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'Maciej W. Rozycki'" <macro@...am.me.uk>,
        Tiezhu Yang <yangtiezhu@...ngson.cn>
CC:     Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
        "linux-mips@...r.kernel.org" <linux-mips@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Xuefeng Li <lixuefeng@...ngson.cn>
Subject: RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence
 with GCC in csum_tcpudp_nofold()

From: Maciej W. Rozycki
> Sent: 17 March 2021 15:36
..
> > > Not that I grok the mips opcodes.
> > > But that code has horridness on its side.
> 
>  It's a 32-bit one's-complement addition.  The use of 64-bit operations
> reduces the number of calculations as any 32-bit carries accumulate in the
> high 32-bit word allowing one instruction to be saved total compared to
> the 32-bit variant.  Nothing particularly unusual for me here; I've seen
> worse stuff with x86.

The 'problem' is that mips doesn't have a carry flag.
So the 64-bit maths is 'tricky'.
It may well be that a loop based on:
	do {
		val = *ptr++;
		sum += val;
		carry += sum < val;
	} while (ptr != limit)
will generate much better code.
I think there is a 'setlt' instruction for the compare.
It certainly would on the nios (which is mips-like).
That is (probably) 6 instructions for 4 bytes.
I suspect there may be a data stall after the memory read.
So an interleaved unroll would remove that stall.
That would be 10 clocks for 8 bytes.

The x86-64 code is 'interesting'.
It has repeated 'add carry' instructions.
On Intel cpus prior to (at least) Haswell they take two clocks each.
So the code is no faster than adding 32bit values to a 64bit sum.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)