lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05ae373684334e6581294baa8afd3238@AcuMS.aculab.com>
Date:   Fri, 14 Jan 2022 17:27:43 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     "'Jason A. Donenfeld'" <Jason@...c4.com>,
        Geert Uytterhoeven <geert@...ux-m68k.org>
CC:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Theodore Tso <tytso@....edu>,
        Greg KH <gregkh@...uxfoundation.org>,
        Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>
Subject: RE: [PATCH v2 2/2] random: use BLAKE2s instead of SHA1 in extraction

From: Jason A. Donenfeld
> Sent: 11 January 2022 12:50
>
> On Tue, Jan 11, 2022 at 1:28 PM Jason A. Donenfeld <Jason@...c4.com> wrote:
> > If you're really quite concerned about m68k code size, I can probably
> > do some things to reduce that. For example, blake2s256_hmac is only
> > used by wireguard and it could probably be made local there. And with
> > some trivial loop re-rolling, I can shave off another 2300 bytes. And
> > I bet I can find a few other things too. The question is: how
> > important is this to you?
> 
> And with another trick (see below), another extra 1000 bytes or so
> shaved off. Aside from moving blake2s256_hmac, I'm not really super
> enthusiastic about making these changes, but depending on how important
> this is to you, maybe we can make something work. There are probably
> additional possibilities too with the code.

Quite clearly whoever wrote the unrolled loops needs their head examined.
It is extremely unlikely that a cpu has enough registers to implement it
effeciently.
(Of course, a pipelined implementation on a fgpa is another matter.)

So every read of v[] is going to be a memory read.
Much better to do that than to spill values that change.
The memory reads won't really hit performance either.
They add a bit of latency - but that will be handled by
instruction scheduling - either by the compiler of cpu hardware.

> -#define ROUND(r) do { \
> -	G(r, 0, v[0], v[ 4], v[ 8], v[12]); \
> -	G(r, 1, v[1], v[ 5], v[ 9], v[13]); \
> -	G(r, 2, v[2], v[ 6], v[10], v[14]); \
> -	G(r, 3, v[3], v[ 7], v[11], v[15]); \
> -	G(r, 4, v[0], v[ 5], v[10], v[15]); \
> -	G(r, 5, v[1], v[ 6], v[11], v[12]); \
> -	G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
> -	G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
> -} while (0)
> -		ROUND(0);
> -		ROUND(1);
> -		ROUND(2);
> -		ROUND(3);
> -		ROUND(4);
> -		ROUND(5);
> -		ROUND(6);
> -		ROUND(7);
> -		ROUND(8);
> -		ROUND(9);

The v[] values clearly don't change in the above.
Use 4 separate arrays so you have:

#define ROUND(r) do { \
	G(r, 0, v[0], w[0], x[0], y[0]); \
	G(r, 1, v[1], w[1], x[1], y[1]); \
	G(r, 2, v[2], w[2], x[2], y[2]); \
	G(r, 3, v[3], w[3], x[3], y[3]); \
	G(r, 4, v[0], w[1], x[2], y[3]); \
	G(r, 5, v[1], w[2], x[3], y[0]); \
	G(r, 6, v[2], w[3], x[0], y[1]); \
	G(r, 7, v[3], w[0], x[1], y[2]); \

Now double the sizes of v/w/x/y array and write the correct
values when they are created/updated and you get:

#define ROUND(r) do { \
	G(r, 0, v[0], w[0], x[0], y[0]); \
	G(r, 1, v[1], w[1], x[1], y[1]); \
	G(r, 2, v[2], w[2], x[2], y[2]); \
	G(r, 3, v[3], w[3], x[3], y[3]); \
	G(r, 4, v[4], w[4], x[4], y[4]); \
	G(r, 5, v[5], w[5], x[5], y[5]); \
	G(r, 6, v[6], w[6], x[6], y[6]); \
	G(r, 7, v[7], w[7], x[7], y[7]); \

Oh - that is a nice loop...
So we get:
	for (r = 0; r < 10; r++)
		for (j = 0; j < 8; j++)
			G(r, j, v[j], w[j], x[j], y[j]);

Which is likely to be just as fast as any other version.

You might need to give the compiler some great big hints
in order to get sensible code.
Possible make v[] w[] x[] and y[] all volatile and replace
the inner loop body with:
			v_j = v[j]; w_j = x[j]; x_j = x[j]; y_j = z[j];
			G(r, j, v_j, w_j, x_j, y_j);

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ