lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 17 Sep 2014 14:52:39 -0400
From: Bill Cox <>
Subject: Re: [PHC] omegacrypt and timing

Hash: SHA1

On 09/17/2014 02:12 PM, Brandon Enright wrote:
> On Wed, 17 Sep 2014 12:14:57 -0500 (CDT) Steve Thomas
> <> wrote:
>>> For the very first round of branching the GPU could run all 4
>>> in parallel and select the correct one. But, for the second
>>> round, the GPU would have to run branch 1 for each of the
>>> possible previous branches, and branch 2 for each of the
>>> possible branches, and so forth.
>>> The carry value is modified from start to finish through each 
>>> branch. There is no realistic way to perform the attack you
>>> describe because the previous branch path possibilities grows
>>> exponentially. After just a few rounds you'd be doing an
>>> exponential amount of work.
>> When you run out of the ChaCha block data you conditionally
>> generate another. There is no exponential cost increase for
>> GPUs.
> Hi Steve, have you looked carefully at how the branching is done? 
> ChaCha is used for more than just choosing the branches.  It also 
> chooses which memory gets read and written.  Generating the ChaCha 
> stream is a non-issue.  You could pre-generate a few MB of it if
> you wanted.
> The issue is that after a round is run, depending on which branch
> was taken a different amount of the ChaCha stream is consumed which
> makes both the branch and the the memory locations of the next
> round depend how much ChaCha stream has been consumed.
> So it's not just a matter of running the next 4 branches in
> parallel and picking one because each branch has multiple different
> memory locations depending on where that branch falls in the
> stream.  The carry value would also be unpredictable.
> To avoid misalignment, if you ran all 4 for round 1, and then
> selected the right one, then all 4 for round 2, then selected the
> right one, etc., you'd be doing 4x as many memory operations and
> you'd need a way of discarding the memory changes made by the 3
> wrong branches.  Is this the attack you're suggesting?
> Brandon

Hi, Brandon.  Yes, I think his is exactly the attack Steve and I are
thinking of.  If you have 4 cases, that could result in 4X more
computation per GPU core, because they have to do each case.

GPU instructions all have a condition on them that say things like
"only execute me if the zero flag is set".  It's used to do
if-then-else statements, where the branch depends on the zero flag,
for example.  It enables all the cores to execute all the instructions
in both branches, and for each core to get the result they need.  It
is very naturally applied, I think to case statements.

This is why I like the AntCrypt solution of having many hash functions
and selecting unpredictably between them.  If the GPU has to execute
them all, and you have 1000 of them, then you just slowed down his
attack by 1000X.

I don't know enough about GPUs to know just how hard it is for them to
effectively become MIMD units by interpreting data as instructions.
However, whatever slow-down that results in is about the most that
multiple cases can do to slow down GPUs, I'm guessing.  I keep
thinking of how my 13,000 functions I just generated would be run, and
it's a *lot* faster than I was hoping.

The only instructions I'm using for hashing are permutations for
mixing: +, -, ^, with either other values in memory or constants,
rotations, and non-linear operations that cannot be reversed: AND and
OR, but where the result is mixed in as a permutation to another
value.  It's not a lot of different assembly code instructions, maybe
only 15 or so.  No matter what I build with these, a GPU will be able
to run an interpreter for my small set of instructions, and I doubt it
will run more than 100X slower.

I haven't been able to figure out a good solution, yet.

Version: GnuPG v1


Powered by blists - more mailing lists