phc-discussions - Re: [PHC] data-dependent branching (Re: [PHC] A little nit which bothers me...)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140403093904.GA30748@openwall.com>
Date: Thu, 3 Apr 2014 13:39:04 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] data-dependent branching (Re: [PHC] A little nit which bothers me...)

On Thu, Apr 03, 2014 at 10:20:05AM +0200, Ralf Zimmermann wrote:
> The idea of the data dependent sequence we aimed for in AntCrypt is to
> enforce the sequence (all functions in a row) but *without* the
> possibility of a pipelined version by only adding the pipeline
> registers: you have to execute all functions in parallel in every step
> and then continue with *one* of the results in *one* of the other
> functions.
> 
> This means, you cannot use an initial delay and afterwards receive the
> result every clock cycle, *unless* you unroll it and spend area for
> all functions in every pipeline step (please correct me if I am wrong,
> maybe I did not think this through correctly?).

I think you can pipeline the functions and receive their results every
clock cycle if you use the routing or buffering approach (number #4 and
number #3 on the list I had posted).  The real question (for the
attacker) is whether this is worth the overhead.  I think for
complicated functions, it is.  And for simple functions their die area
is negligible, and portions of them like multipliers may easily be
shared between them, as Bill pointed out.

In fact, now that I think of it that way, the sharing of e.g.
multipliers that Bill pointed out is not very different from "my"
routing approach.  It's a similar concept applied at a different level:
in Bill's example it'd be "within a core" (and would apply to portions
of functions) and in my example it'd be "across cores" (and would apply
to whole functions).  Looking at this from yet another angle it can be
said that with the routing approach we're re-defining "core" such that
it tests lots of candidate passwords at once using its pool of
resources.

> This does not matter if you use friendly functions like XOR, shift,
> etc., but we hope that using more complex but still CPU friendly
> functions (this is why we tried floating-point), the area consumption
> *may* hurt a little more.

Maybe, but how would you avoid the kind of sharing that Bill mentioned?
Can you come up with a large number of sufficiently different functions
that not much sharing could occur?  To me, even 16 feels too low, and I
doubt you can come up with 16 without allowing for shared multipliers,
etc., yet being CPU-friendly.

> @Bill: Of course you are correct and memory usage in memory-restricted
> environments may be a better way to annoy FPGAs. But as we primarily
> included the branching to slow down GPUs, we thought about how it may
> still add a little bit to FPGAs :)

I think slowing down GPUs is better done with bcrypt-like memory
accesses, but it's good to have some diversity in the approaches among
PHC submissions.

Here's a CPU attack on AntCrypt (as I understand it): define all
possible two-function groups.  For 16 original functions, this will
result in 256 dual-function functions, and if they're as tiny as they
currently are they will probably still fit in a CPU's L1i cache.  Now
you can define a dual-AntCrypt, testing two candidate passwords at a
time, and using the CPU core's parallel processing capabilities (SIMD,
multiple issue, super-scalar) it'll likely run no slower than a single
AntCrypt does.  So you get a 2x performance boost for the attacker.
Moreover, the cost of conditional branches is amortized, so the speedup
might be slightly higher than 2x.  (On PA-RISC, you could even go for 3x
(4096 triple-functions) or 4x (65536 quad-functions), but as Thomas
mentioned new PA-RISC systems are no longer sold.)

The above CPU attack may be combined with the buffering approach, to
improve locality of reference to the pieces of code if it exceeds L1i
cache by a small factor.  You can then e.g. split 4096 triple-functions
into 4 groups of 1024, and stall/buffer processing for individual
candidates until enough sequential triple-functions from within the same
group of 1024 are to be computed.  Then you go ahead and call those
triple-functions in their correct sequence (all hitting the same 1/4th
of the total code size), empty that one bucket of stalled/buffered
states, and proceed to stall/buffer more of them until the threshold is
once again reached for any of the four groups of 1024.  The question is
whether the state buffering would comfortably fit in some data cache or
RAM, without unacceptably slowing this down.  I guess there exist
algorithms where most of this data can be in L2 cache or above
(sequential addition and reading of buffered states), with only e.g. a
hash table and some counters in L1d.  But I could be wrong.  Either way,
this is only a relatively minor (in terms of benefit to attacker)
enhancement to the simple idea presented in the previous paragraph.

A way to mitigate these CPU attacks would be to define many more
individual functions.  Another way would be to include sufficient
parallelism in them (tunable to be barely enough for current and
very-near-future CPUs), so that the defender is not at a disadvantage.
These two approaches may be used together.  (Not that I like this whole
approach, but like I said some diversity in PHC is good anyway.)

Alexander