phc-discussions - Re: [PHC] Multiply with CUDA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <862080371.709003.1411326681350.open-xchange@oxuslxltgw10.lxa.perfora.net>
Date: Sun, 21 Sep 2014 14:11:21 -0500 (CDT)
From: Steve Thomas <steve@...tu.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Multiply with CUDA

> On September 20, 2014 at 6:14 AM Solar Designer <solar@...nwall.com> wrote:
>
> On Fri, Sep 19, 2014 at 07:00:24PM -0500, Steve Thomas wrote:
> > but still a little faster than CPUs.
>
> I guess you mean in terms of throughput? Per multiprocessor vs. per core?
>

Total speed, if you tell a CPU and a GPU to do a bunch of multiplies, then I'm
pretty sure GPUs are slightly faster. I think the source I got this from was
doing arbitrarily large integer multiplies. If you only need to do 32 bit
floating point multiply then GPUs will be "10x" faster.


> > When comparing
> > CPUs and GPUs with hash function speeds GPUs are ~10x faster than optimized
> > SIMD
> > CPU code. So we're losing SIMD with multiply so that's a 8x hit. GPUs have a
> > similar hit on speed while doing smaller multiplies which is another ~4x
> > slowdown.
>
> You lost me here. In what case are we "losing SIMD with multiply"? Do
> you mean e.g. when we use specifically the 64x64->128 multiply on CPU?
>

There is no 64x64->128 vector multiply on x86.


> One of the reasons why I don't use 64x64->128 in yescrypt is that
> 64x64->128 is not directly available on 32-bit CPUs and in 32-bit mode
> on 64-bit CPUs. With 32-bit CPUs/mode in mind, it's 32x32->64 max, and
> we do have SIMD with that.
>

Most of what I was talking about is for arbitrarily large multiplies. I was
really thinking of Makwa. I probably should of mention this.


> > Last note, interleaving MULX (umul128), ADCX (_addcarryx_u64), and ADOX
> > (_addcarryx_u64) with VPMULUDQ (_mm256_mul_epu32) might get better
> > performance
> > on CPUs. MULX and VPMULUDQ should be similar in speed since VPMULUDQ can do
> > 4x(32bit*32bit=64bit) but there's 4x more work to do than doing
> > 64bit*64bit=128.
> > Interleaving them should mask some of the latency.
>
> I view potential SIMD/scalar interleaving as implementation detail, as
> long as the hashing scheme provides sufficient/tunable parallelism.
>
> Why do you say that "4x(32bit*32bit=64bit)" is "4x more work" than
> "64bit*64bit=128"?

Using "32bit*32bit=64bit" to do "64bit*64bit=128bit" takes four multiplies. Well
if you have a "33bit*33bit=66bit" or you only need "62bit*62bit=124bit", then
just three multiplies (and some extra additions and subtracts).