phc-discussions - [PHC] yescrypt & Lyra2 on GPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180428212332.GA31541@openwall.com>
Date: Sat, 28 Apr 2018 23:23:32 +0200
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: [PHC] yescrypt & Lyra2 on GPU

Hi,

Due to use in altcoins, there are new results for yescrypt 0.5'ish (PHC
v0 from 2014) and Lyra2 (I don't know which revision) on NVIDIA GPUs.
Unfortunately, the discussion thread is in Japanese and the code is on
OneDrive, but the CUDA source is in there and it is quite clear:

http://askmona.org/9245

BTW, I'd appreciate it if someone gets me in touch with the author of
that miner mod via e-mail.  (I think there's no e-mail address given in
the zip archive, nor in the thread, nor in askmona user profile.)

The main approach is similar to what we see in argon2-gpu, where a
single hash computation is split across many "threads" (32 in case of
argon2-gpu), but taken to an even lower level (where what were
individual SIMD vector elements on CPU, or even their 32-bit halves if
64-bit, become separate SIMT threads, even if frequent synchronization
and data exchange or duplicate work is needed).  I didn't look into
what's achieved there for Lyra2, as I'm naturally more interested in
what happens to yescrypt.

For yescrypt, the split is across 16 SIMT threads, but due to yescrypt's
use of 64-bit math each pair of those threads ends up doing duplicate
work by two threads within the pair, so it's actually a split into 8
rather than into 16 parallel computations.  I guess such duplicate work
isn't any worse than leaving those threads unused would have been (if
we're bumping into NVIDIA's shared memory size) whereas the extra data
exchanges that would be needed otherwise would have cost extra.

There's simply no more parallelism in yescrypt than that: it's four
16-byte parallel lanes within each 64-byte sub-block, and then it's two
64-bit halves within those lanes, additionally with a frequent data
dependency (each pwxform round) between the halves (but this
implementation goes for the split and the resulting overhead anyway).

Here's what this looks like in code:

__device__ __forceinline__ uint2 mad64(const uint32_t a, const uint32_t b, uint2 c)
{
#if 0
        return vectorize((uint64_t)a * (uint64_t)b) + c;
#else
        asm("{\n\t"
                "mad.lo.cc.u32 %0, %2, %3, %0; \n\t"
                "madc.hi.u32 %1, %2, %3, %1; \n\t"
                "}\n\t" : "+r"(c.x), "+r"(c.y) : "r"(a), "r"(b)
        );
        return c;
#endif
}

#define Shared(a) *(uint2*)&shared_mem[(threadIdx.y * 512 + (a)) * 4 + (threadIdx.x & 2)]

                uint2 buf;
                uint32_t x[256];

                        for (k = 0; k < r * 2; k++) {
                                x[k] ^= x[(k + r * 2 - 1) & (r * 2 - 1)];
                                buf.x = __shfl_sync(0xFFFFFFFF, x[k], 0, 2);
                                buf.y = __shfl_sync(0xFFFFFFFF, x[k], 1, 2);
#pragma unroll
                                for (j = 0; j < 6; j++) {
                                        x0 = ((__shfl_sync(0xFFFFFFFF, buf.x, 0, 4) >> 4) & 255) + 0;
                                        x1 = ((__shfl_sync(0xFFFFFFFF, buf.y, 0, 4) >> 4) & 255) + 256;
                                        buf = mad64(buf.x, buf.y, Shared(x0));
                                        buf ^= Shared(x1);
                                }
                                if (threadIdx.x & 1) x[k] = buf.y;
                                else x[k] = buf.x;
                        }

The 64-bit "uint2 buf" is located within each 32-bit SIMT thread (in
hardware, it's two 32-bit elements in two different SIMD registers),
mad64() performs two 32-bit MADs within each thread sequentially (or
perhaps pipelined) to produce the 64-bit result, and the S-box lookups
are done using indices from the lower 64-bit half of the original (CPU
implementation's) 128-bit SIMD vectors obtained using __shfl_sync() to
read them from the same or the adjacent thread.  I'm not sure if
__shfl_sync() is fast enough for this to be worth it compared to split
into fewer threads.  Perhaps it is, since in hardware it's just access
to within the same SIMD registers - much like we extract indices from
the needed part of the SIMD registers on CPU.  And as you can see the
even/odd threads store only 32 bits of the 64-bit values they computed -
that's because of the duplicate work between them.

The resulting performance is quite good.  A certain cryptocoin is using
yescrypt 0.5'ish (PHC v0 from 2014) at N=4096 r=16 p=1 t=0 (8 MiB), and a
person behind it was kind enough to share benchmarks of the above GPU
miner with me.  For reference, on i7-4770K for that revision of yescrypt
and with those parameters I am getting ~740 h/s, with new implementation
optimizations backported from yescrypt 1.0 (but retaining compatibility
with 0.5) this improves to ~770 h/s.  And the results sent to me for
NVIDIA Pascal are:

GTX 1060 - 1060 h/s

GTX 1070 - 1800 h/s

GTX 1080 - 3200-3300 h/s

GTX 1080 Ti - 3600-4200 h/s

Also, I'm told the GPU global memory usage is approx. at 2/3 of max, so
the miner is not bumping into that yet.  Per the code and the comments
in the Japanese thread (after Google Translate), the miner is bumping
into so-called shared memory within each SM (as I had intended), but it
nevertheless is able to run 2 or 3 warps at once.

Thus, GTX 1080 Ti gives a speedup of ~5x, but it also costs ~5x more
than a modern CPU of performance comparable to or faster than the
quad-core i7-4770K (roughly $1k for GTX 1080 Ti vs. $200 for 6-core
i5-8400 or $300 for a budget 8-core AMD Ryzen).  OTOH, GPUs win in terms
of amortization of cost of the rest of the rig (since many GPUs can be
put in the same rig cheaply).  Yet another aspect is that for cryptocoin
mining the GPUs would be better used on another coin, where they provide
much more of a speedup anyway (e.g., on Zcash's Equihash).

Then there are NVIDIA Volta GPUs, which are currently available on AWS
and which are roughly twice faster than the largest Pascal ones.

But we're interested in the performance figures here for how they'd
relate to uses of yescrypt for password hashing, where there wouldn't
necessarily be a "more profitable" hash to crack.  Well, yescrypt 1.0 by
default uses 12 KiB for S-boxes vs. 0.5's 8 KiB.  So maybe the number of
concurrent warps would have to be lowered accordingly, which will slow
things down by up to as much (that is, up to by a third).  There are
also the added writes into S-boxes, which are essentially free on CPU
(the L1 cache write ports and the execution port for store instructions
would have been idle otherwise), but will probably cost a little bit of
time within each thread on GPU.

(For the cryptocoin folks reading this: no, I don't recommend that you
upgrade to yescrypt 1.0.  You can do better than that, if you want to
defeat GPUs at all rather than keep their usage moderate as it happens
on its own currently.  There might also be a special revision of new
yescrypt for the PoW use case later.  Please feel free to e-mail me.)

Overall, it looks like with current yescrypt a dual-CPU server doing
password hashing at low to moderate memory cost (a few megabytes per
hash) will be roughly on par with one of the largest GPUs, when not
using a ROM.

It's also curious that yescrypt at low memory settings like 2 MiB is
currently slightly more GPU resistant (relative to its performance on
CPU) than it is at 8 MiB: either fits in global memory anyway, but 2 MiB
has higher L3 cache hit rate on CPU and thus its pwxform runs relatively
faster while hurting GPUs as much.  Somewhere from 16 to 32 MiB might
finally bump into global memory size again, and thus work against GPUs.

With these results, I am once again thinking of possibly moving from 12
to 24 KiB for the S-boxes in yescrypt by default.  yescrypt 1.0 already
has this as a configurable parameter (it's part of flags, and it gets
encoded into the hash strings and decoded from there - for values up to
768 KiB), but right now it's compile-time (the implementation will
refuse to process hashes that use a different than its compiled-in S-box
size).  Maybe in 1.1+ it'll be runtime.

On current Intel CPUs, the slowdown when going from 12 to 24 KiB is from
non-existent (on some servers under full load, which is actually the
most relevant scenario) to a few percent.  On AMD Bulldozer, it's worse
than that, but maybe we don't care about that microarch anymore?

There's also rumor that future Intel CPUs will have 48 KiB of L1d cache:

https://browser.geekbench.com/v4/cpu/2400363

(up from the current 32 KiB), so two SMT threads with 24 KiB each would
fit again (or maybe we'll move to 48 KiB S-boxes then, or to wider SIMD
and L2 cache - all of which is already tunable in yescrypt, just not yet
one of the standard configurations).

For yescrypt with a ROM, there are no GPU implementations (that I'm
aware of) yet.  I suppose they could eventually use CUDA uniform memory
to access host's or/and other GPUs' memory (as the ROM is meant to be
much larger than a single GPU card's memory).  This will increase
latency, and to compensate for that even more concurrent instances would
need to be run on each GPU, again bumping into the GPU's memory sizes
(on-die and/or on-card, whichever is hit first), so the speed will be
impacted.

So I think for yescrypt with a ROM this isn't too severe a threat yet,
but bumping up the S-boxes size could make sense anyway.

Finally, if very frequent __shfl_sync()'s work OK here, I am wondering
if they'd also be beneficial for bcrypt.  Currently the speedups on
large GPUs that we see for yescrypt above are on par with those we see
for bcrypt (and this confirms that yescrypt's bcrypt-like anti-GPU works
as intended).  For example, i7-4770K does 4.3k h/s for bcrypt cost 2^5
defensively (one hash per thread), 6.6k h/s offensively (2 or 3 hashes
per thread), and a GTX 1080 Ti at stock clocks does 20.7k h/s or with
o/c does 22.8k h/s per Jeremi's hashcat benchmarks:

https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073

That's roughly a factor of 5 increase from CPU defense to GPU attack
performance, similar to what we see for yescrypt 0.5'ish.

But maybe bcrypt speeds on NVIDIA Maxwell/Pascal/Volta can be improved
further with this trick.  Maybe the four Blowfish S-box lookups can
reasonably proceed concurrently in four adjacent SIMT threads, and then
their results can be combined in one of the four threads that would do
the add/xor/add.  The host code would need to expect final results from
every fourth thread only.

Since both JtR and Hashcat have gone OpenCL only (dropping CUDA), this
will need to be done in OpenCL, but I see no problem there - at worst,
inline asm will have to be used for the shfl.idx.b32 PTX instruction.
We're already using tiny bits of PTX asm in OpenCL kernels in JtR.

Anyone wants to give this a try?

The above code for yescrypt does not try to split the two S-box lookups
across threads, though.  (It only splits the four portions of each of
the two 128-bit lookups.)  So maybe that would have been too extreme, or
maybe it's worth trying there as well, or maybe it's unneeded there
because there's other use those threads can be put to due to the 128-bit
lookups (which isn't the case in bcrypt, where they're 32-bit).

Alexander