lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 07 Jul 2015 13:01:32 -0300
From: Marcos Simplicio <>
Subject: Re: [PHC] GPU benchmarks: Lyra2+yescrypt

Hi, Alexander.

On 04-Jul-15 14:34, Solar Designer wrote:
> Hi Marcos,
> On Thu, May 07, 2015 at 01:08:13PM -0300, Marcos Simplicio wrote:
>> It took some time, but we finally completed the GPU benchmarks mentioned
>> in the e-mail below, both for Lyra2 and yescrypt.
> [...]
> Have you made any further progress on this since May?

Yes, we did: we have been working with the ccminer's code you mentioned
earlier, especially to use their assembly portions. So far, we were able
to reduce yescrypt's cracking time to 2/3+ of our original results. This
is still ~30% slower than what the miner is able to provide with its
dedicated structure (e.g., no tcost parameter or support for
parallelism), but our code works for any set of parameters.

After we finish those optimizations, we are planning to see how much of
it applies to Lyra2's code.

> I'd like to let you know that Agnieszka Bielec working with us on
> integrating support for PHC finalists into John the Ripper, including on
> GPU, appears to have achieved a 3x+ speedup for Lyra2 over your CUDA
> code, as currently tested at t_cost=8, m_cost=8, N_COLS=256, nPARALLEL=2
> (your defaults).  I think this corresponds to 192 KiB per instance:
> The current choice of 8 and 8 is arbitrary.  We should test with other
> settings as well.  Especially with the lowest t_cost, and higher m_cost,
> and with nPARALLEL=1.

That is very interesting! Supposing the same optimizations we are
applying to yescrypt also apply to Lyra2's code, that means Agnieszka
code would still likely be ~2x faster than ours, so it is definitely
worth taking a look. I hope the same techniques apply to CUDA too,
because we have no OpenCL programmer on our side...

As for the optimizations being used, I'm not a GPU programmer, so I
asked our "GPU-guy" to take a look on the discussion. In principle (and
assuming he understood correctly what is being done), he liked the idea
of a "hybrid" use of the global and shared memory. On the other hand, it
is indeed a good point to try different values for tcost, as that may
influence the gains obtained with this strategy.

I personally would also suggest playing a little with the values of
m_cost and N_COLS (e.g., doubling the first and halving the second) to
see what happens. The idea is that, for such small memory usages, the
effects of CPU caching may actually not be needed, so a lower N_COLS
(and, thus, higher m_cost) may make more sense for legitimate users in
these conditions. At least our observations so far indicate that
attackers may benefit from higher N_COLS (as opposed to higher m_cost)
better than defenders, as the gains we saw diminish as N_COLS approach
the maximum cache utilization (e.g., an N_COLS of 128 was only ~5% worse
than an N_COLS of 256 in our machine).



Powered by blists - more mailing lists