phc-discussions - Re: [PHC] yescrypt on GPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALWCaoYLaiMx+35ZyPbTbDGhkGvMRLq3M1M0V=_K8RhVJJYybg@mail.gmail.com>
Date: Tue, 6 Oct 2015 21:12:18 +0200
From: Massimo Del Zotto <massimodz8@...il.com>
To: Solar Designer <solar@...nwall.com>
Cc: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt on GPU

> Yes, but why waste all 64 work-items on one yescrypt hash computation?


I made a bet. I knew, since before I started working on this (just looking
at the slides and such) the point was to produce a lot of memory
transactions. And there was this whole bcrypt thing which supposedly
produced a lot of 'small memory transactions' I had to optimize for
bandwidth usage an occupancy.
This was confirmed as I started looking at the code as this thing does not
look like having nearly enough arithmetic intensity to be accelerated by
compute-oriented architectures.
I assumed I would have got an overabundance of ALU and scarceness of
bandwidth and I had to get the highest occupancy I could to hide latency.
Apparently that wasn't exactly the case (but we know that now). Anyway: the
costly operations is memory transactions it seems, so I focused on them.

There were various attempts including a 'sliced' version which would keep
several hashes in flight working on a line of the block each time but it
didn't work as expected.
There's also the problem being the AMD OpenCL-C compiler. It is kinda a
black box and very often too conservative.
At the end, some of the things I do are evolutionary. The parallel
processing of the various chunks was added at the last revision and it
kinda did the trick.

> And you could also spend
> fewer WIs per yescrypt instance, for up to the obvious 64 yescrypt's in
> 64 WIs, in which case you wouldn't need any communication via local
> memory, but that's not necessarily more optimal

Indeed using the 1WI = 1 thread method has many advantages, including more
aggressive optimization by the compiler... but I don't see that fitting the
VGPR limit. I know djm's implementation does that but I haven't looked at
it. Given the current CL-C compiler I don't think it is possible to trick
it in outputting something decent.
He also has published a multi-step kernel. I looked at it 5 minutes and I
cannot explain myself why it is so much slower than the monolithic kernel.
It seems nVidia Maxwells like the multi-step kernel more. I would expect
the nV CL compiler to be much smarter.

> Do you see unaligned loads somewhere in yescrypt?  There shouldn't be any.

My error. I forgot the applied bitmask zeroes out those bits. It is more
readily apparent in the 0.7.1 formulation.

> When you tried uints, did you adjust your use of the work-items
> accordingly, so that you'd have more instances of yescrypt in the 64?

If memory serves it was 16x4... but I'm pretty sure the chunk loop wasn't
being 'transversely unrolled' at the time. I think I was mangling 8-ulongs
at time back then.
Plus I have to admit I still have plenty of LDS so perhaps I could try
going back at some point. I can try some different combinations for science.

> In fact, it doesn't make any sense to waste an entire CU on one hash,
> yet keep the S-boxes in global memory.  You have 64 KB of local memory
> per CU, enough for several yescrypt's.  Have you tried keeping the
> S-boxes in local memory?

There are a few nuances here.
The first is that OpenCL exposes 32KiB for some reason. In my experience,
going over 8KiB is painful. I don't see sufficient arithmetic intensity to
cope with the extremely limited occupancy that would result (note: AMD
CodeXL analysis gives double occupancy on Hawaii for some reason on LDS
limited scenarios, experiences welcome). Plus, block loads are strangely
slow in my experience.

There has been a point in the past where all 64WIs were doing exactly the
same, without even the "transversally unrolled" block_pwxform.
This does not make sense in a purely SIMD architecture, but GCN isn't
purely SIMD.
My estimation was: scarce memory resources. So the idea was to use WIs as
dumb loaders and ideally degenerate the whole CU to run purely on SALU.
It didn't quite work back then and it has no chance to work now; I tried.
Just to be clear, I'm pretty sure GCN can do better somehow but my minimum
standards are met with this set of kernels - no overspilling, ISA size
exceeding in no performance path, not slower than published, keeps system
smooth.
I don't see that approaching modern Intel processors any time soon.

To summarize, I wanted to get more from the memory subsystem and I
considered trading ALU power for it.

> What makes you think it's "1 [bytes]"?  That would be weird.

Indeed the AMD APP guide I consulted reads a bit different than what I
remember and it seems to suggest it's 1 uint. Which would make sense as 1-
it's a memory controller width and 2- it's historical RGBA32 so there might
be some fast path.

Nonetheless in my tests I have measured increasingly more efficient data
transfers (in terms of less traffic amplification, not necessarily faster)
the smaller the stride. Admittedly I never tested 1 byte in particular as
uint is just too convenient... but I started to have this impression by
looking at profiler output and I somehow convinced myself this time plus:
it was just handy. If somehow tested this more in depth, I would be happy
to hear those experiences.


Cordially,
Massimo

Content of type "text/html" skipped