linux-kernel - RE: [PATCH 3/3] riscv: optimized memset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <26a7af6f33fa440f986adb4d690f47dc@AcuMS.aculab.com>
Date: Thu, 1 Feb 2024 23:04:48 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Nick Kossifidis' <mick@....forth.gr>, Jisheng Zhang <jszhang@...nel.org>,
	Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt
	<palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>
CC: "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Matteo Croce
	<mcroce@...rosoft.com>
Subject: RE: [PATCH 3/3] riscv: optimized memset

...
> > +		/* Compose an ulong with 'c' repeated 4/8 times */
> > +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> > +		cu *= 0x0101010101010101UL;

That it likely to generate a compile error on 32bit.
Maybe:
		cu *= (unsigned long)0x0101010101010101ULL;
> > +#else
> > +		cu |= cu << 8;
> > +		cu |= cu << 16;
> > +		/* Suppress warning on 32 bit machines */
> > +		cu |= (cu << 16) << 16;
> > +#endif
> 
> I guess you could check against __SIZEOF_LONG__ here.

Or even sizeof (cu), possible as:
		cu |= cu << (sizeof (cu) == 8 ? 32 : 0);
which I'm pretty sure modern compiler will throw away for 32bit.

I do wonder whether CONFIG_ARCH_HAS_FAST_MULTIPLIER is worth
testing - you'd really want to know there is a risc-v cpu
with a multiply that is slower than the shift and or version.
I actually doubt it.
Multiply is used so often (all array indexing) that you
really do need something better than a '1 bit per clock' loop.

It is worth remembering that you can implement an n*n multiply
with n*n 'full adders' (3 input bits, 2 output bits) with a
latency of 2*n adders.
So the latency is only twice that of the corresponding add.
For a modern chip that is not much logic at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)