[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87r4agmaan.fsf@wolfjaw.dfranke.us>
Date: Sat, 16 Nov 2013 14:11:28 -0500
From: Daniel Franke <dfoxfranke@...il.com>
To: Solar Designer <solar@...nwall.com>
Cc: discussions@...sword-hashing.net
Subject: Re: [PHC] The EARWORM password hash
Thanks a bunch for the feedback. There's a lot of it. I'm only getting
to about half of it in this reply.
Solar Designer <solar@...nwall.com> writes:
> Right. How do you propose the ROM be initialized? Is this within scope
> of EARWORM itself or totally external to it?
The way I do it for my test vectors is to generate an AES-CTR keystream
from the key "dontusethiskeyinproduction". What I have in mind for
production use is to give the user a program that reads a key from
/dev/[u]random, generates an arena from it and writes it to disk, and
then prints the key to stdout with instructions to write it down and
store it offline. The user can then regenerate or enlarge the arena by
resupplying the key. This will all be discussed in my submission
document, but only in the spec proper insofar as the test vectors are
concerned. The rest will be relegated to the "Usage Considerations"
section.
I considered ideas similar to yours for speeding up initialization time,
but couldn't convince myself that anything I thought of was secure. I
eventually just gave up because I don't think most users are going to be
willing to use an arena large enough to make addressing the issue worth
the added complexity.
> The above has what we'd consider a flaw under our threat model (attackers
> with many smaller memory machines), but you might not (you mentioned
> using this with only 256 MiB anyway). scratchpad[] elements beyond
> index 0 are only needed for the final prf() computation, so instead of
> working on the full arena[] under each workunit, an implementation could
> efficiently work on portions of the (re-structured) arena[] on a machine
> with a smaller amount of fast memory (and swap those portions outside of
> the "for d from 0 to (WORKUNIT_DEPTH / 2)" loop).
I only used the 256MiB arena for the sake of GPU experimentation (my
7850 has a 512MiB contiguous-allocation limit). In practice, I'm
picturing sizes from 1-16 GiB. The idea of trying to compete with
attackers on the total memory capacity of a single node is not one I've
attempted to pursue, but if somebody convinces me that it's a good idea
then the tweak necessary to defeat this attack is pretty
straightforward. Just flatten the array and allow chunks to start at any
block.
> Have you experimented with such prefetching (using the corresponding CPU
> instructions)? Similarly, have you tried removing the second index
> variable (and thus the ability to prefetch this much in advance) - does
> it affect performance on current CPUs or does it not matter (yet)?
Yes, my optimized implementation contains prefetch instructions. I get
about a 5% performance hit if I comment them out.
Powered by blists - more mailing lists