phc-discussions - Re: [PHC] omegacrypt and timing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Wed, 17 Sep 2014 19:41:09 +0000
From: Brandon Enright <bmenrigh@...ndonenright.net>
To: Steve Thomas <steve@...tu.com>
Cc: discussions@...sword-hashing.net, bmenrigh@...ndonenright.net
Subject: Re: [PHC] omegacrypt and timing

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 17 Sep 2014 13:50:01 -0500 (CDT)
Steve Thomas <steve@...tu.com> wrote:

> > To avoid misalignment, if you ran all 4 for round 1, and then
> > selected the right one, then all 4 for round 2, then selected the
> > right one, etc., you'd be doing 4x as many memory operations and
> > you'd need a way of discarding the memory changes made by the 3
> > wrong branches. Is this the attack you're suggesting?
> >  
> 
> No, I'm saying that a GPU will waste clock cycles while not
> calculating the wrong data paths. This is do to it's conditional
> execution of instructions. If a thread is not suppose to run an
> instruction it will do a nop (no operation) instead.

Interesting.  So let me make sure I understand what this attack would
look like.

You'd N instances of OmegaCrypt on the GPU by allocating N ChaCha
states and N large regions of memory.  Then you'd allocate 4 threads
(or maybe 5 if you need a master thread) for each OmegaCrypt
instance and then 3N of the 4N threads would be able to
data-dependently disable (nop) themselves each round. In this way you'd
be able to keep 4N threads in sync with each other even though only 1N
worth are doing useful work.

If this is the case, using my current design I'd have to increase the
branch paths a lot, something that seems hacky and I really don't want
to do.

So a question about GPU memory.  If you have a ton of threads each
accessing memory at random, how well does this scale?  It won't exhaust
memory bandwidth but won't even a small number of threads exhaust
the rate at which the memory can serve access from "cold"
banks/blocks/regions?

Brandon

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlQZ4+AACgkQqaGPzAsl94J0PwCgkK84p89W3q/W+MsbX1q5MJa4
pfUAoMXvcm5wjrGh7s2EoFxWbrewc4uz
=WxpC
-----END PGP SIGNATURE-----