lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 15 Sep 2014 18:49:24 -0400
From: Bill Cox <>
Subject: Re: [PHC] A review per day - Catena

Hash: SHA1

On 09/15/2014 08:19 AM, Dmitry Khovratovich wrote:
> Catena-3 with 128 MB of memory would need 2^21 64-byte SHA blocks,
> so 2^23 compression function calls, i.e. 15mJ within 8/(316/65)=1.5
> seconds. Our tradeoffs have computational penalty 3.75q for the
> memory reduction by q, so, for instance, running it with 16 MB of
> memory would require 30x more energy, or 450mJ within the same 1.5
> seconds.
> The memory will consume far larger. The 128-MB Catena-3
> reads/writes 768 MB of data from/to RAM, and 1.5 GB in tradeoffs.
> If we consider GDDR5, advocated by Bill, and scale it down to 300
> MHz, it would at least consume 0.5W, so its energy consumption
> would be 750mJ, and 94mJ for 16MB.
> Therefore, the attacker will reduce his costs even if the memory is
> reduced 8-fold. And we have not counted the retention energy yet,
> as it will raise the tradeoff efficiency even more.

OK, so let's work with these numbers.  Your 128MiB is in external
DRAM, plain and simple.  Let's just say nice fast GDDR5 memory.  Every
access of 64 bytes comes with about a 10ns cache miss penalty, which
will dominate your runtime, so you need a plan to deal with it.
Fortunately Catena uses predictable addressing, and you can interleave
memory for multiple password guesses.  With no TMTO, if I have a total
of 16 1GiB external memory banks, that's 128 guesses in parallel,
which reduces the cache miss penalty from 10ns per 64 bytes to 10ns
per 8KiB, or about 78ps per block of 64 bytes.  That's a good number,
because transferring the 64 bytes will only take about 2ns.  Without
this strategy, you'll be suffering massive cache miss penalties.  This
ASIC system I assume consumes about 70W, half in the GDDR5 memory,
half on the ASIC.  That's half the power that my CPU in the same
process burns while computing hashes hundreds of times more slowly, so
I think I'm not being to hard on the ASIC by saying it consumes 35W.
Rougly, I'll guess that 1/3 in on-chip memory, 1/3 in computation, and
1/3 in the I/O ring.

Now, if you do a 8X reduction in memory for a 30X recomputation
penalty, and you're still in external GDDR5 DRAM, you're dead.  You
started within 2X of being memory bandwidth limited, so you're going
to run at least 15X slower, for an 8X memory reduction.  That external
GDDR5 memory may not consume 35W anymore, but since we still have 16
high bandwidth ports running at full steam, good luck getting it below
25W.  Clearly, this TMTO leads to higher power, not lower.

Let's suppose a 16-to-1 TMTO with a 30X recomputation penalty allows
you to fit in 16MiB of on-chip cache.  This is twice the cache I have,
but not the largest ever built by Intel.  If you don't want to run
slower than before, let's assume you can have 30X free processing
cores that consume no power.

Because of predictable addressing, you can pipeline the snot out of
your cache, and you may be able to actually design a 30-port version.
 With fewer ports, you're going to run slower than the version with
external GDDR5 memory.   To deliver the data you need when you need
it.  We'll assume your 30-port 16MiB cache runs effectively at
infinite speed, because of the pipelining.  So, now you compute hashes
just as fast, with a 16X TMTO, with a 16X time*memory benefit.

However, the power per read/write of this cache ram is quite a bit
*higher* than my Intel processor's.  Assuming around 1/3 of it's 70W
is for on-chip memory access, your cache would be running at about
350W!  That's game over.

Do you really think you can design a 5-10 TiB 30-port pipelined 16MiB
cache that consumes less power than this?

Version: GnuPG v1


Powered by blists - more mailing lists