phc-discussions - Re: [PHC] A review per day

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <54176CF4.5090107@ciphershed.org>
Date: Mon, 15 Sep 2014 18:49:24 -0400
From: Bill Cox <waywardgeek@...hershed.org>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] A review per day - Catena

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/15/2014 08:19 AM, Dmitry Khovratovich wrote:
> Catena-3 with 128 MB of memory would need 2^21 64-byte SHA blocks,
> so 2^23 compression function calls, i.e. 15mJ within 8/(316/65)=1.5
> seconds. Our tradeoffs have computational penalty 3.75q for the
> memory reduction by q, so, for instance, running it with 16 MB of
> memory would require 30x more energy, or 450mJ within the same 1.5
> seconds.
> 
> The memory will consume far larger. The 128-MB Catena-3
> reads/writes 768 MB of data from/to RAM, and 1.5 GB in tradeoffs.
> If we consider GDDR5, advocated by Bill, and scale it down to 300
> MHz, it would at least consume 0.5W, so its energy consumption
> would be 750mJ, and 94mJ for 16MB.
> 
> Therefore, the attacker will reduce his costs even if the memory is
> reduced 8-fold. And we have not counted the retention energy yet,
> as it will raise the tradeoff efficiency even more.
> 

OK, so let's work with these numbers.  Your 128MiB is in external
DRAM, plain and simple.  Let's just say nice fast GDDR5 memory.  Every
access of 64 bytes comes with about a 10ns cache miss penalty, which
will dominate your runtime, so you need a plan to deal with it.
Fortunately Catena uses predictable addressing, and you can interleave
memory for multiple password guesses.  With no TMTO, if I have a total
of 16 1GiB external memory banks, that's 128 guesses in parallel,
which reduces the cache miss penalty from 10ns per 64 bytes to 10ns
per 8KiB, or about 78ps per block of 64 bytes.  That's a good number,
because transferring the 64 bytes will only take about 2ns.  Without
this strategy, you'll be suffering massive cache miss penalties.  This
ASIC system I assume consumes about 70W, half in the GDDR5 memory,
half on the ASIC.  That's half the power that my CPU in the same
process burns while computing hashes hundreds of times more slowly, so
I think I'm not being to hard on the ASIC by saying it consumes 35W.
Rougly, I'll guess that 1/3 in on-chip memory, 1/3 in computation, and
1/3 in the I/O ring.

Now, if you do a 8X reduction in memory for a 30X recomputation
penalty, and you're still in external GDDR5 DRAM, you're dead.  You
started within 2X of being memory bandwidth limited, so you're going
to run at least 15X slower, for an 8X memory reduction.  That external
GDDR5 memory may not consume 35W anymore, but since we still have 16
high bandwidth ports running at full steam, good luck getting it below
25W.  Clearly, this TMTO leads to higher power, not lower.

Let's suppose a 16-to-1 TMTO with a 30X recomputation penalty allows
you to fit in 16MiB of on-chip cache.  This is twice the cache I have,
but not the largest ever built by Intel.  If you don't want to run
slower than before, let's assume you can have 30X free processing
cores that consume no power.

Because of predictable addressing, you can pipeline the snot out of
your cache, and you may be able to actually design a 30-port version.
 With fewer ports, you're going to run slower than the version with
external GDDR5 memory.   To deliver the data you need when you need
it.  We'll assume your 30-port 16MiB cache runs effectively at
infinite speed, because of the pipelining.  So, now you compute hashes
just as fast, with a 16X TMTO, with a 16X time*memory benefit.

However, the power per read/write of this cache ram is quite a bit
*higher* than my Intel processor's.  Assuming around 1/3 of it's 70W
is for on-chip memory access, your cache would be running at about
350W!  That's game over.

Do you really think you can design a 5-10 TiB 30-port pipelined 16MiB
cache that consumes less power than this?

Bill
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUF2zwAAoJEAcQZQdOpZUZINIQAIJeCwGFm4qf418dFA1oiaXI
wQaywTQ3+9b4iVKFUqp3d0SOxTlhUz/utCNdlAOT1Pm22DyM9oK5ilVHDpkf3CYS
ySoc8D05c1dd0b87P9I2UauTJkzvjyFc90liBzprPlLmXMwNtT/ekVugi+CFtnm8
5UI9azvQNArEtewwWq7YcJUxU/vxccfZVX9Y51pNQCA24r7EtT0NkS9Ma85pgSai
Vyegi0fXmyDb5x/xtbJCv3htkyEA/LHvN+Eo/dvX/bsDSCsogWQ1bo6+Rm6qreZ9
ncE7aiXlK3IjJKtagq63C+8y3VDkqguOyanEyHZDwEP6aE2q481SbWO9CTl69HjS
5i3W40WKA5fJmga1kz3FU921cVKRLYNJYlD97do/Nl2GRPn+cn3lOJ9/d24r4eJN
al+FGQIBC9fa5xKLBHAQVe+njZCHFGdtjcFUZWcWBERFaLE6U34NE1japS1R6GqC
oX3LpJhJDOoIjqJm/MIaIdJrzPDLkuvb0dfa9JPdRoei4Rzg2EwPQVV/3EFIW4yO
CSlLwrIY0/vkb3Cj5FsFbPrPgdFLAEWps0eT3kHcHkh33Ll1cOLmowto7wvKwbL2
UKhY77S0NFmgdenIXH4U44W6OvdqAbVgLot+1FWWjOMNIOsDd3/JuTzUiIxW2Hi8
t3SDYyOVTDwhKL93/l/3
=v2Km
-----END PGP SIGNATURE-----