lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <5419D877.2080001@ciphershed.org> Date: Wed, 17 Sep 2014 14:52:39 -0400 From: Bill Cox <waywardgeek@...hershed.org> To: discussions@...sword-hashing.net Subject: Re: [PHC] omegacrypt and timing -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/17/2014 02:12 PM, Brandon Enright wrote: > On Wed, 17 Sep 2014 12:14:57 -0500 (CDT) Steve Thomas > <steve@...tu.com> wrote: > >>> For the very first round of branching the GPU could run all 4 >>> in parallel and select the correct one. But, for the second >>> round, the GPU would have to run branch 1 for each of the >>> possible previous branches, and branch 2 for each of the >>> possible branches, and so forth. >>> >>> The carry value is modified from start to finish through each >>> branch. There is no realistic way to perform the attack you >>> describe because the previous branch path possibilities grows >>> exponentially. After just a few rounds you'd be doing an >>> exponential amount of work. > >> When you run out of the ChaCha block data you conditionally >> generate another. There is no exponential cost increase for >> GPUs. > > Hi Steve, have you looked carefully at how the branching is done? > ChaCha is used for more than just choosing the branches. It also > chooses which memory gets read and written. Generating the ChaCha > stream is a non-issue. You could pre-generate a few MB of it if > you wanted. > > The issue is that after a round is run, depending on which branch > was taken a different amount of the ChaCha stream is consumed which > makes both the branch and the the memory locations of the next > round depend how much ChaCha stream has been consumed. > > So it's not just a matter of running the next 4 branches in > parallel and picking one because each branch has multiple different > memory locations depending on where that branch falls in the > stream. The carry value would also be unpredictable. > > To avoid misalignment, if you ran all 4 for round 1, and then > selected the right one, then all 4 for round 2, then selected the > right one, etc., you'd be doing 4x as many memory operations and > you'd need a way of discarding the memory changes made by the 3 > wrong branches. Is this the attack you're suggesting? > > Brandon > > Hi, Brandon. Yes, I think his is exactly the attack Steve and I are thinking of. If you have 4 cases, that could result in 4X more computation per GPU core, because they have to do each case. GPU instructions all have a condition on them that say things like "only execute me if the zero flag is set". It's used to do if-then-else statements, where the branch depends on the zero flag, for example. It enables all the cores to execute all the instructions in both branches, and for each core to get the result they need. It is very naturally applied, I think to case statements. This is why I like the AntCrypt solution of having many hash functions and selecting unpredictably between them. If the GPU has to execute them all, and you have 1000 of them, then you just slowed down his attack by 1000X. I don't know enough about GPUs to know just how hard it is for them to effectively become MIMD units by interpreting data as instructions. However, whatever slow-down that results in is about the most that multiple cases can do to slow down GPUs, I'm guessing. I keep thinking of how my 13,000 functions I just generated would be run, and it's a *lot* faster than I was hoping. The only instructions I'm using for hashing are permutations for mixing: +, -, ^, with either other values in memory or constants, rotations, and non-linear operations that cannot be reversed: AND and OR, but where the result is mixed in as a permutation to another value. It's not a lot of different assembly code instructions, maybe only 15 or so. No matter what I build with these, a GPU will be able to run an interpreter for my small set of instructions, and I doubt it will run more than 100X slower. I haven't been able to figure out a good solution, yet. Bill -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJUGdh0AAoJEAcQZQdOpZUZDTIP/2+mYBo9Nsxfrx4Dg79TloFx PRdsyB9rDj+emHZnlZu6SgYd3QNxXIgKJIxI1BppE5+Uz3TLf8QFtykecsVd0/IK j4gZM1AGy+s9MoC34EZT+NJDEN6rvVbgS9QY6EwpoAx747S/zLbsd5nopRaJbNFE UCUi/1yX/3UO9G44MI5CLLDRtf5OfG2UN9h9RKOVp143ezkmCAU4t4k2dAwQ4Yl2 ZCd0ekooc6VXlUY68m7QS8N6RtgQKsPUhyiSx3oz0it+kL4bBDMYXeqDof0ooBNT Nhyy+ugzknkxYu3+zn10d4Oq5WzQ/dMMnK8G82VhfxaVBRrd+16qxqUZyABpSTxA UyFTwLqJeBVSkSf/mDsUNob1IVKl3m9mzIdqFfa6gtnbR0er7p4tqfpEWNHIrJex N21XEz0ODd/ZVE/qAFsgXxJRtEVgGlX9gdEhRAKRlUGZCIsOZvM704rQbIGdiHQw Mbu0ctMiwYJoq2bjWG0//JTaKpVnpEEJK94Ww64N8L6yMfFwrr4Zgq5lNS9U/XRM pUmfPBh6XRgYjN7knK6038PAZEDQdJ1Mm8qDLKcs6CQ5xPTbaHplvWuHGu2u3sOk zkMAV9YL80QIuyfo/lmf0u4tkbIOuLyYMEjZXsCV9tvqFwR9NJFFi31Ky7RMtCVC cjqtiYFzyoCy2/LPPIwa =JOlw -----END PGP SIGNATURE-----
Powered by blists - more mailing lists