lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Mon, 12 Oct 2015 17:08:19 +0200
From: Massimo Del Zotto <>
To: Solar Designer <>
Subject: Re: [PHC] yescrypt on GPU

I read the links a few times. I feel like I'm getting a bit lost connecting
them to the instruction latency issue being considered. Maybe that's not
even the point anymore, I'm not focused.
What I'm sure is that between errata, language barriers and maybe some slip
error the documentation around is fairly inconsistent.

*When the linearized work size is < native wavefront size the hardware is
The ALU still dispatches at full rate but the some elements of a SIMD will
be masked out. In the profiler this produces high VALUBusy% but low
VALUUtilization% (how many elements of a SIMD are valid). This matches my
experience and the picture with the grayed mask. Size 32 ->
VALUUtilization% <= 50%, Size 8 -> VALUUtilization <= 12.5%, but VALUBusy%
can still be 100%.

*It is necessary to provide at least 1 Wavefront each SIMD.*
In my experience even in this case the performance is very low but I cannot
tell whatever this is related to ALU operation itself or something else
ticking inefficiently. The benefit going to occupancy 1/10 to 2/10 is
usually immense but that could well be due to memory latency hiding alone.

*It seems some instructions even have variable clock count.*
Taking from codeXL disassembly of SecondSmix1, S_CBRANCH_EXECNZ (there are
6) is reported as taking 4/16 clocks (depending if coherent or not).
V_CMP_GT_U64 (1!) takes 32 clocks.
The rest are only SALU ops such as S_AND_SAVEEXEC_B64 (8), S_ANDN2_B64
(13?), S_MOV_B64 (33), S_AND_B64 (6). They all take 4 clocks. This is from
~2400 ops. *Where did the 64-bit ops go?* I'm surprised as well.

Your observation about LDS layout is accurate and matches my understanding.

I made an error in quoting the message from himanshu.
Emphasis was made on the 3 quarters of a wavefront being used to hide
instruction latency but I agree that he doesn't make much sense as it seems
to contradict the grayed image (as well as my understanding). In my mind I
probably "auto-corrected" CU to "SIMD unit". This would *make sense*. The
chip-level dispatcher puts wavefronts to CUs and the CUs distribute them to
the SIMD lanes as they become available. The SIMD is 16 uint wide so it
needs 4 clocks to process 64 WIs in packed registers (similarly to some
CPUs). Then if the ALU is 4-stage pipeline we have basically 0 latency.
While the CU-level dispatcher waits for those 4 clocks, it dispatches to
the other SIMD lanes.
That's probably the "4 stages pipeline" realhet writes as:

stage0,         stage1,         stage2,         stage3
instr0[ 0..15], idle,           idle,           idle
instr0[16..31], instr0[ 0..15], idle,           idle
instr0[32..47], instr0[16..31], instr0[ 0..15], idle
instr0[48..63], instr0[32..47], instr0[16..31], instr0[ 0..15]
And here we are at 4 cycle latency: the workitems[0..15] of Instr0 is
completed, the ALU can continue with the first quarter of the next
instr1[0..15], instr0[48..63], instr0[32..47], instr0[16..31]

Yes, we need 64 WIs to fill a SIMD and 256 WIs to fill a CU (4 clocks!).
SIMD units from the same CU can communicate only using memory (it's not
like they pipeline themselves somehow).

Is there a problem with that? It seems to me at this point the model is
coherent but I suspect I might be misunderstanding the point of your last


Content of type "text/html" skipped

Powered by blists - more mailing lists