phc-discussions - Re: [PHC] yescrypt AVX2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150425184601.GA21408@openwall.com>
Date: Sat, 25 Apr 2015 21:46:01 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt AVX2

On Sat, Apr 25, 2015 at 12:23:29PM -0300, Marcos Antonio Simplicio Junior wrote:
> Just to share some experience we had on the AVX2 matter: last year, an undergrad (student of a colleague of mine, from a different university) implemented Lyra2 taking advantage of AVX2 and got a 30% speed up with his implementation. 

Thanks!

I'm not surprised.  When we added AVX2 to JtR recently, we got speedups
of around 80% for many hash types (attack).  For POMELO, it's almost a
2x speedup between AVX and AVX2 (defense).  That's at low m_cost, indeed.
Your 30% sounds about right, since I guess you were measuring at higher
m_cost where you bump into the memory bandwidth.

yescrypt's pwxform is special.  It focuses on the S-box lookups width
it's been tuned for.  That's intentional, it's part of the defense.
When wider SIMD or/and memory bus to the S-boxes is available, it is
just wasted(*) - much like bcrypt wastes GPU global memory bandwidth
(because the bus and cache lines are much wider than needed), if its
S-boxes are placed in global memory.

(*) I am referring only to the S-box lookups here.  Wider SIMD may still
be used for the MUL-ADD-XOR portion of pwxform.  I actually tried to
exploit precisely this combination of different-width SIMD in pwxform.

However, I expected AVX2 could still be used to speed it up even when
tuned for AVX and below (or e.g. NEON), which would be both good and
bad.  It turned out I was wrong, which is also both good and bad.
So I felt I needed to post the status report in here, and also seek
confirmation of my understanding or advice on how I could have sped
things up without going for 256-bit or wider S-box lookups.  So far,
Samuel provided confirmation (thanks!)

Now I can introduce a combination of PWX* settings that is a better fit
for AVX2, providing speedups similar to what we're seeing for other
hashes, to be used by those who don't need bcrypt-equivalent anti-GPU if
these AVX2+ settings happen to be used on pre-AVX2 CPUs as well.  (There
would be no slowdown on those older CPUs.  Just the bcrypt-style GPU
resistance would be less.)

Alexander