lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Wed, 12 Feb 2014 15:27:11 -0500
From: Bill Cox <>
Subject: Re: [PHC] Interleaved attack on cache-timeing-resistant KDFs

On Wed, Feb 12, 2014 at 12:14 PM, Solar Designer <> wrote:
> On Tue, Feb 11, 2014 at 05:13:20PM -0500, Bill Cox wrote:
>> It is far
>> more important to hammer the memory bandwidth, so a fast hash
>> function, and running hash functions in parallel are the important
>> things, as these increase memory bandwidth.
> These are important, yes.  Sequential multiplies are also important.
> Ideally, a defensive implementation would use both memory bandwidth and
> multiplies at nearly the maximum speed the machine can support.

Good point, memory is tied up longer if the computation loop cannot be
accelerated.  On an FPGA, multiplication is *much* slower than on my
PC or an advanced ASIC.  For example, this guy is pretty frustrated,
and he's just trying to get the multiply to run at 125MHz:

and this guy only gets 179 MHz.  I'd be surprised to see a 32x32-->32
multiply on an FPGA in under 4X longer than the fastest CPUs.  If an
password cracker adds cores to fill his bandwidth, this slower speed
will mean he has to have a lot more memory.

I feel better now :-)   Multiplication compute time hardening is
clearly useful against FPGAs.

So... I happen to be one of the best people on the planet for
architecting a low cost SRAM based FPGA-like thingys with one high
speed GDDR5 port and a bunch of 32-bit ALUs (including a fast
multiplier) for cheaply guessing NeolKDF, Scrypt, or other memory-hard
password hashes.  Configurable logic architectures and software for
configuring them is sort of my thing.  Regardless of who wins the PHC,
do you think I could get the NSA to pay me a lot of money to build
them low-cost, high-speed guessing chips?  :-p

At a cost of around $100 to generate one guess per second, I don't
know if even the NSA will bother.

>> Consider the case of a KDF that has a slow but cryptographically
>> secure hashing function in its inner loop, which generally runs in
>> cache rather than out of external DRAM.  The thinking is that
>> read/writes of small 64-byte hashes randomly is inefficient for
>> external cheap DRAM, but quite efficient for cache.  Instead of having
>> several MiB of expensive on-chip cache for every guessing core, an
>> attacker simply interleaves the memory for many guesses in parallel.
>> For example, if the KDF does 64 bytes of memory read/write at a time
>> to 10MiB of on-chip cache, then an attacker can instead interleave
>> memory for 256 guesses in parallel into 16KiB blocks of memory and
>> store them in cheap external DRAM.  There would be 16KiB read from
>> external DRAM every time the KDF would normally read a 64-byte hash
>> from cache, so we stream memory in/out efficiently at near max speed.
> Right.
> However, note that we know from Litecoin that 128-byte lookups from
> external RAM on GPU are actually quite efficient - as long as you can
> have lots of them pending (like a GPU can), to hide the latency.
> 64-byte is not that much smaller.

That's good to know.  This is yet another reason to go for high memory
bandwidth with a Catena-style KDF.

> Of course, Litecoin's 128 KB (and it's 64 KB with 2x TMTO, which the
> miners use on GPU) is very little, which is what permits a lot of
> instances to fit in GPU card's RAM and, in turn, permits for a lot of
> pending lookups.  At YACoin's current 4 MB, the speed ends up being
> CPU-like, and we'll see what it'll be with 8 MB soon.

Only 64KB!  Well, I guess I can't argue with success.  Litecoin seems
to be going well, but that choice seems almost designed for graphics

> So tying up significant amounts of memory per instance is important, for
> multiple reasons.
>> This means cache-timing-attach immune KDFs need as simple a hash
>> function as possible, and maybe they should do some hashing in
>> parallel, with SIMD and/or multiple threads to get their memory
>> bandwidth higher.  This is the best way to defend against brute-force
>> guessing attacks.

We need a better phrase than "cache-timing-attack immune" KDFs.
That's a mouthful.  In my docs, I'm using "pure" KDF to mean no
password dependent addressing, "dirty" to mean just do what you want
with addressing, and "hybrid" to describe KDFs like Scrypt which do
"pure" hashing followed by "dirty" hashing.  My terms kind of suck,
but it would be nice to agree on some simple terms for these different
kinds of memory-hard KDFs.

> I agree, but throwing in some sequential multiplies is also helpful, as
> long as this doesn't reduce defender's memory bandwidth usage.
> Alexander

The original version of NoelKDF that had massive parallelism hashes
2GB in .305 seconds using 2 threads, with a memory bandwidth of
13GB/s.  I looked at the generated code, and its making nice use of
SSE automatically.  With the serial multiplications, it's taking 0.42
seconds with 2 threads, and the generated code has no SSE in the inner
loop.  That's 9.5GB/s.  This is a significant memory bandwidth
penalty, though it may hurt an attacker as much or more than me.  They
can always add more cores to fill their memory bandwidth, but they'll
have to keep buying more RAM to make that math work.

Thanks for pointing out the error in my logic.  I simply had forgotten
to multiply the external memory by the number of parallel guessing
cores.  So... the NoelKDF hash function survives one more day,..


Powered by blists - more mailing lists