linux-kernel - RE: [PATCH v4 2/5] riscv: Add checksum library

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1818c4114b0e4144a9df21f235984840@AcuMS.aculab.com>
Date:   Tue, 12 Sep 2023 08:45:38 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Charlie Jenkins' <charlie@...osinc.com>,
        Palmer Dabbelt <palmer@...belt.com>,
        Conor Dooley <conor@...nel.org>,
        Samuel Holland <samuel.holland@...ive.com>,
        "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     Paul Walmsley <paul.walmsley@...ive.com>,
        Albert Ou <aou@...s.berkeley.edu>
Subject: RE: [PATCH v4 2/5] riscv: Add checksum library

From: Charlie Jenkins
> Sent: 11 September 2023 23:57
> 
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits. Benchmarking by proxy compiling
> csum_ipv6_magic (64-bit version) for an x86 chip as well as running
> the riscv generated code in QEMU, discovered that summing in a
> tree-like structure is about 4% faster than doing 64-bit reads.
> 
...
> +	sum   = saddr->s6_addr32[0];
> +	sum  += saddr->s6_addr32[1];
> +	sum1  = saddr->s6_addr32[2];
> +	sum1 += saddr->s6_addr32[3];
> +
> +	sum2  = daddr->s6_addr32[0];
> +	sum2 += daddr->s6_addr32[1];
> +	sum3  = daddr->s6_addr32[2];
> +	sum3 += daddr->s6_addr32[3];
> +
> +	sum4  = csum;
> +	sum4 += ulen;
> +	sum4 += uproto;
> +
> +	sum  += sum1;
> +	sum2 += sum3;
> +
> +	sum += sum2;
> +	sum += sum4;

Have you got gcc to compile that as-is?

Whenever I've tried to get a 'tree add' compiled so that the
early adds can be executed in parallel gcc always pessimises
it to a linear sequence of adds.

But I agree that adding 32bit values to a 64bit register
may be no slower than trying to do an 'add carry' sequence
that is guaranteed to only do one add/clock.
(And on Intel cpu from core-2 until IIRC Haswell adc took 2 clocks!)

IIRC RISCV doesn't have a carry flag, so the adc sequence
is hard - probably takes two extra instructions per value.
Although with parallel execute it may not matter.
Consider:
	val = buf[offset];
	sum += val;
	carry += sum < val;
	val = buf[offset1];
	sum += val;
	...
the compare and 'carry +=' can be executed at the same time
as the following two instructions.
You do then a final sum += carry; sum += sum < carry;

Assuming all instructions are 1 clock and any read delays
get filled with other instructions (by source or hardware
instruction re-ordering) even without parallel execute
that is 4 clocks for 64 bits, which is much the same as the
2 clocks for 32 bits.

Remember that all the 32bit values can summed first as
they won't overflow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)