phc-discussions - Re: [PHC] Weakness in keystretch hashing, and a solution (I think)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Sun, 5 Jan 2014 02:53:50 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Weakness in keystretch hashing, and a solution (I think)

On Sat, Jan 04, 2014 at 05:45:36AM -0500, Bill Cox wrote:
> On Sat, Jan 4, 2014 at 3:09 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > How did you arrive at log2(numPages)?  Is this assuming that the
> > attacker keeps a certain number of full pages (perhaps some function of
> > numPages) and thus only recomputes the missing random pages until
> > finally hitting a present one?  If so, why don't you count those pages
> > towards attacker's memory usage?
> 
> Here I assumed an attacker only keeps the 64-byte key data, and no RAM data
> at all.  To generate the next page of data, I randomly pick a previous
> page, which on average is 1/2 the way back through memory.  Since that page
> is not available, I have to generate it, again jumping on average half way
> back.  This converges to an average of log2(N), where I'm generating the
> Nth page.

Sounds right.

> > > The simplest fix I see is to increase the state used to compute pages to
> > > the page length or more (scrypt uses 2X).
> >
> > Where do you see scrypt using 2x?  The size of "X" in scrypt is the same
> > as its block size.
> 
> The code I read had an X and Y, each being 1 KB, so the majority of the
> state in scrypt's RNG seems to be 2KB.  It's block size is 1KB.  Wouldn't I
> have to store both X and Y to reproduce the output from a specific point?

No, only X is needed.  Having X and Y is an optimization in some of the
implementations (uses of X and Y are interleaved in a 2x unrolled loop
in order to avoid unnecessary data copying).

BTW, increasing the state size beyond the block size is a way to
discourage TMTO.  Then whatever blocks the attacker chooses not to store
will require for recomputation not only other blocks, but also more of
the normally-not-stored state.  IIRC, I first arrived at this thought
when considering adding a Blowfish-like construct with variable S-boxes
(primarily as an anti-GPU measure) that would keep state across
invocations of BlockMix.  The size of such S-boxes would be a few KB,
and usually higher than the 128*r block size (e.g. for r=8).  So I think
we may have this as an optional feature, to be enabled when neither TMTO
nor GPU friendliness are desirable for the defender.  Drawbacks:
complexity, too many knobs.

> > > Some thoughts: later pages are accessed less than early pages, but why
> > > do I care?  If the attacker has to keep later pages in RAM, we've won.
> >
> > This makes sense, but the attacker may have multiple types of RAM (e.g.
> > local to the cores vs. far away).  For later pages that are less likely
> > to ever be read, they may be sent to the far-away memory, thereby
> > leaving more of the very limited area close to the cores for use for
> > portions that are more likely to be randomly read from.  This works well
> > if the address to physical memory location mapping is static (e.g.
> > hard-wired in the circuit).

Consider that first N/2 memory may be local and fast and second N/2 far
away and slow.  With your fill and hash approach, 7/8th of the reads
will be from the first N/2 and only 1/8th from the second N/2.  This is
because on iteration counts up to N/2, all of the reads are from the
first half, and then the ratio of reads from the first half linearly
decreases to 1/2, so it is on average 3/4 during the second N/2
iterations.  Averaging 1 and 3/4 again (to get the all time average), we
get the 7/8 figure.

So with a very simple circuit and no overhead - no address translation,
no caching, no data copying (swapping in/out) - the attacker is able to
use a lot slower memory for the second N/2 locations, while incurring
only very moderate performance impact overall.  For example, if the
hashing is memory bandwidth and/or latency bound, and the slower memory
is 2x slower than fast memory in these terms (whichever combination of
them is relevant), the total running time will be only 9/8 of that with
100% of fast memory.  So at 2/3 of the full memory speed, the running
time is increased to only 9/8.  This tradeoff is likely beneficial to
attackers with ASICs.

> > That said, you're right that requiring some relatively unlikely to be
> > read memory is better than not requiring it at all.  It's just that the
> > advantage from continuing to fill memory after time/2 is less than 2x,
> > and it is unclear where in the 1x to 2x range it is.  To make matters
> > worse, this issue arises within the first time/2 iterations as well, as
> > with your current approach accesses are non-uniform even within the
> > first half of the allocation and running time.  Thus, it is not clear
> > whether the total cost is even in the 1x to 2x range, or possibly below
> > 1x of scrypt's.  We avoid this in escrypt by ensuring uniform access
> > after time/2 (this gives us the 1x scrypt baseline), as well as closer
> > to uniform access before time/2 (this probably gives us nearly 1.33x
> > scrypt, but with no guarantee).
> 
> There is a significant advantage in generating pages that depend on a
> random previous page in terms of how much memory an attacker can eliminate
> without incurring a huge runtime.
> 
> The attack I have been thinking about is when the attacker wants to use
> N/2, if N is the number of memory locations.
[...]

Yes, TMTO needs to be considered too.

> > In other words, assuming same memory latency and cost (and other costs,
> > such as routing and access ports), your approach has 2x higher area*time
> > than scrypt, and escrypt's is exactly 4/3 of scrypt.  (For same running
> > time.)  However, with the unknowns considered (which may be affecting
> > the time factor via latency and the area factor via other aspects),
> > things become unclear, and escrypt's approach provides better assurance
> > (while possibly only 2/3 of NOELKDF's attack cost in many real-world
> > scenarios, I admit).
> >
> > Building upon scrypt, I wanted to stay strictly no worse than it rather
> > than merely likely no worse in many real-world scenarios.

> I suspect you're still thinking that the 2 loops in scrypt require twice
> the memory bandwidth as noelkdf.  I don't think that's true.

It's not true for the original scrypt, but it is for some modes in which
escrypt may run: when random lookups are enabled in first loop and
random writes are enabled in second loop.  The random writes are always
disabled when p>1, though, so for high p it's 1.5x the memory data
transfer size of NOELKDF per the same total allocation size.

> You are likely to succeed with escript.  I think trying to make a better
> scrypt rather than a whole new system will be an attractive to a lot of
> people.  Just the fact that you're doing that gives me enough confidence
> that this contest will find a decent solution that I hardly feel any need
> to participate, other than to have fun.  I was thinking of forking
> TrueCrypt to add scrypt, but I'd lean towards an escript enable version
> now.  You don't have to prove your security with escript, at least in
> compatibility mode.  I do with noelkdf or any other new system.  There's
> incredible value in time-tested algorithms in security.  They're just not
> as much fun :-)  Also, it's fun to do better.

Your participation is helpful in providing a KISS design and
implementation to use as a baseline - to see whether the higher
complexity of escrypt or whatever is justified.  This makes it
harder for me to compete, but that's great.

Alexander