lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 12 Jan 2014 10:44:41 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] escrypt memory access speed (Re: [PHC] Reworked KDF
 available on github for feedback: NOELKDF)

On Sun, Jan 12, 2014 at 5:02 AM, Solar Designer <solar@...nwall.com> wrote:
> On Sat, Jan 11, 2014 at 04:42:10PM -0500, Bill Cox wrote:
> Having less work per pipeline stage
> helps keep all logic busy: by the time the signal has propagated, we
> take it to the next stage and reuse the current stage's logic for the
> next set of inputs.  Now, I admit we can't benefit from a pipeline when
> all we have is one instance of Salsa20/N and our (attacker's) goal is
> solely to minimize latency.  But even then, I don't see why having a
> large number of rounds in one clock cycle would substantially reduce the
> total latency, as compared to having only a few rounds per cycle.
> I think the latency in ns should be similar regardless of how this is
> structured - well, perhaps only slightly lower (like by 10%) with more
> rounds per clock cycle.  Am I wrong?
>
> I think the added latency from having (reasonably) more computation may
> increase the AT cost, by tying up the memory for longer.  This reduces
> our reliance on attackers' memory bandwidth being limited.
>
> Alexander

I haven't looked carefully enough at SHA-256 to know how fast it could
be in hardware.  Actually, I'm more of a place and route guy than a
digital synthesis guy, so my opinions are ball-park.

However, I have looked at Salsa/8.  I know you know it by heart, but
it helps for me to see it:

        for (i = 0; i < 8; i += 2) {
#define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
                /* Operate on columns. */
                x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
                x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);

                x[ 9] ^= R(x[ 5]+x[ 1], 7);  x[13] ^= R(x[ 9]+x[ 5], 9);
                x[ 1] ^= R(x[13]+x[ 9],13);  x[ 5] ^= R(x[ 1]+x[13],18);

                x[14] ^= R(x[10]+x[ 6], 7);  x[ 2] ^= R(x[14]+x[10], 9);
                x[ 6] ^= R(x[ 2]+x[14],13);  x[10] ^= R(x[ 6]+x[ 2],18);

                x[ 3] ^= R(x[15]+x[11], 7);  x[ 7] ^= R(x[ 3]+x[15], 9);
                x[11] ^= R(x[ 7]+x[ 3],13);  x[15] ^= R(x[11]+x[ 7],18);

                /* Operate on rows. */
                x[ 1] ^= R(x[ 0]+x[ 3], 7);  x[ 2] ^= R(x[ 1]+x[ 0], 9);
                x[ 3] ^= R(x[ 2]+x[ 1],13);  x[ 0] ^= R(x[ 3]+x[ 2],18);

                x[ 6] ^= R(x[ 5]+x[ 4], 7);  x[ 7] ^= R(x[ 6]+x[ 5], 9);
                x[ 4] ^= R(x[ 7]+x[ 6],13);  x[ 5] ^= R(x[ 4]+x[ 7],18);

                x[11] ^= R(x[10]+x[ 9], 7);  x[ 8] ^= R(x[11]+x[10], 9);
                x[ 9] ^= R(x[ 8]+x[11],13);  x[10] ^= R(x[ 9]+x[ 8],18);

                x[12] ^= R(x[15]+x[14], 7);  x[13] ^= R(x[12]+x[15], 9);
                x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
#undef R
        }

This look is executed 4 times for Sasla20/8.  Tracing the data path
for x[0], I see a depth of 4 32-bit additions and 4 32-bit XORs, per
loop, and there are 4 loops for a total depth of 16 add/xor stages.

This does not look as challenging to compute as a 32x32 multiply.  An
Intel CPU does a 64x64 multiply in 3 clocks.  This should be doable as
fast, I think.

So, hand-optimized Salsa20/8 in 20nm is maybe is 3 clocks at 3.5GHz.
That's my best guess.  A multiplier designer could probably be more
accurate.

Bill

Powered by blists - more mailing lists