lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 17 Sep 2014 14:52:39 -0400
From: Bill Cox <waywardgeek@...hershed.org>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] omegacrypt and timing

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



On 09/17/2014 02:12 PM, Brandon Enright wrote:
> On Wed, 17 Sep 2014 12:14:57 -0500 (CDT) Steve Thomas
> <steve@...tu.com> wrote:
> 
>>> For the very first round of branching the GPU could run all 4
>>> in parallel and select the correct one. But, for the second
>>> round, the GPU would have to run branch 1 for each of the
>>> possible previous branches, and branch 2 for each of the
>>> possible branches, and so forth.
>>> 
>>> The carry value is modified from start to finish through each 
>>> branch. There is no realistic way to perform the attack you
>>> describe because the previous branch path possibilities grows
>>> exponentially. After just a few rounds you'd be doing an
>>> exponential amount of work.
> 
>> When you run out of the ChaCha block data you conditionally
>> generate another. There is no exponential cost increase for
>> GPUs.
> 
> Hi Steve, have you looked carefully at how the branching is done? 
> ChaCha is used for more than just choosing the branches.  It also 
> chooses which memory gets read and written.  Generating the ChaCha 
> stream is a non-issue.  You could pre-generate a few MB of it if
> you wanted.
> 
> The issue is that after a round is run, depending on which branch
> was taken a different amount of the ChaCha stream is consumed which
> makes both the branch and the the memory locations of the next
> round depend how much ChaCha stream has been consumed.
> 
> So it's not just a matter of running the next 4 branches in
> parallel and picking one because each branch has multiple different
> memory locations depending on where that branch falls in the
> stream.  The carry value would also be unpredictable.
> 
> To avoid misalignment, if you ran all 4 for round 1, and then
> selected the right one, then all 4 for round 2, then selected the
> right one, etc., you'd be doing 4x as many memory operations and
> you'd need a way of discarding the memory changes made by the 3
> wrong branches.  Is this the attack you're suggesting?
> 
> Brandon
> 
> 

Hi, Brandon.  Yes, I think his is exactly the attack Steve and I are
thinking of.  If you have 4 cases, that could result in 4X more
computation per GPU core, because they have to do each case.

GPU instructions all have a condition on them that say things like
"only execute me if the zero flag is set".  It's used to do
if-then-else statements, where the branch depends on the zero flag,
for example.  It enables all the cores to execute all the instructions
in both branches, and for each core to get the result they need.  It
is very naturally applied, I think to case statements.

This is why I like the AntCrypt solution of having many hash functions
and selecting unpredictably between them.  If the GPU has to execute
them all, and you have 1000 of them, then you just slowed down his
attack by 1000X.

I don't know enough about GPUs to know just how hard it is for them to
effectively become MIMD units by interpreting data as instructions.
However, whatever slow-down that results in is about the most that
multiple cases can do to slow down GPUs, I'm guessing.  I keep
thinking of how my 13,000 functions I just generated would be run, and
it's a *lot* faster than I was hoping.

The only instructions I'm using for hashing are permutations for
mixing: +, -, ^, with either other values in memory or constants,
rotations, and non-linear operations that cannot be reversed: AND and
OR, but where the result is mixed in as a permutation to another
value.  It's not a lot of different assembly code instructions, maybe
only 15 or so.  No matter what I build with these, a GPU will be able
to run an interpreter for my small set of instructions, and I doubt it
will run more than 100X slower.

I haven't been able to figure out a good solution, yet.

Bill
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUGdh0AAoJEAcQZQdOpZUZDTIP/2+mYBo9Nsxfrx4Dg79TloFx
PRdsyB9rDj+emHZnlZu6SgYd3QNxXIgKJIxI1BppE5+Uz3TLf8QFtykecsVd0/IK
j4gZM1AGy+s9MoC34EZT+NJDEN6rvVbgS9QY6EwpoAx747S/zLbsd5nopRaJbNFE
UCUi/1yX/3UO9G44MI5CLLDRtf5OfG2UN9h9RKOVp143ezkmCAU4t4k2dAwQ4Yl2
ZCd0ekooc6VXlUY68m7QS8N6RtgQKsPUhyiSx3oz0it+kL4bBDMYXeqDof0ooBNT
Nhyy+ugzknkxYu3+zn10d4Oq5WzQ/dMMnK8G82VhfxaVBRrd+16qxqUZyABpSTxA
UyFTwLqJeBVSkSf/mDsUNob1IVKl3m9mzIdqFfa6gtnbR0er7p4tqfpEWNHIrJex
N21XEz0ODd/ZVE/qAFsgXxJRtEVgGlX9gdEhRAKRlUGZCIsOZvM704rQbIGdiHQw
Mbu0ctMiwYJoq2bjWG0//JTaKpVnpEEJK94Ww64N8L6yMfFwrr4Zgq5lNS9U/XRM
pUmfPBh6XRgYjN7knK6038PAZEDQdJ1Mm8qDLKcs6CQ5xPTbaHplvWuHGu2u3sOk
zkMAV9YL80QIuyfo/lmf0u4tkbIOuLyYMEjZXsCV9tvqFwR9NJFFi31Ky7RMtCVC
cjqtiYFzyoCy2/LPPIwa
=JOlw
-----END PGP SIGNATURE-----

Powered by blists - more mailing lists