lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 21 Sep 2014 14:11:21 -0500 (CDT)
From: Steve Thomas <>
Subject: Re: [PHC] Multiply with CUDA

> On September 20, 2014 at 6:14 AM Solar Designer <> wrote:
> On Fri, Sep 19, 2014 at 07:00:24PM -0500, Steve Thomas wrote:
> > but still a little faster than CPUs.
> I guess you mean in terms of throughput? Per multiprocessor vs. per core?

Total speed, if you tell a CPU and a GPU to do a bunch of multiplies, then I'm
pretty sure GPUs are slightly faster. I think the source I got this from was
doing arbitrarily large integer multiplies. If you only need to do 32 bit
floating point multiply then GPUs will be "10x" faster.

> > When comparing
> > CPUs and GPUs with hash function speeds GPUs are ~10x faster than optimized
> > SIMD
> > CPU code. So we're losing SIMD with multiply so that's a 8x hit. GPUs have a
> > similar hit on speed while doing smaller multiplies which is another ~4x
> > slowdown.
> You lost me here. In what case are we "losing SIMD with multiply"? Do
> you mean e.g. when we use specifically the 64x64->128 multiply on CPU?

There is no 64x64->128 vector multiply on x86.

> One of the reasons why I don't use 64x64->128 in yescrypt is that
> 64x64->128 is not directly available on 32-bit CPUs and in 32-bit mode
> on 64-bit CPUs. With 32-bit CPUs/mode in mind, it's 32x32->64 max, and
> we do have SIMD with that.

Most of what I was talking about is for arbitrarily large multiplies. I was
really thinking of Makwa. I probably should of mention this.

> > Last note, interleaving MULX (umul128), ADCX (_addcarryx_u64), and ADOX
> > (_addcarryx_u64) with VPMULUDQ (_mm256_mul_epu32) might get better
> > performance
> > on CPUs. MULX and VPMULUDQ should be similar in speed since VPMULUDQ can do
> > 4x(32bit*32bit=64bit) but there's 4x more work to do than doing
> > 64bit*64bit=128.
> > Interleaving them should mask some of the latency.
> I view potential SIMD/scalar interleaving as implementation detail, as
> long as the hashing scheme provides sufficient/tunable parallelism.
> Why do you say that "4x(32bit*32bit=64bit)" is "4x more work" than
> "64bit*64bit=128"?

Using "32bit*32bit=64bit" to do "64bit*64bit=128bit" takes four multiplies. Well
if you have a "33bit*33bit=66bit" or you only need "62bit*62bit=124bit", then
just three multiplies (and some extra additions and subtracts).

Powered by blists - more mailing lists