phc-discussions - Re: [PHC] yescrypt AVX2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1288849836.51025574.1429975409046.JavaMail.root@larc.usp.br>
Date: Sat, 25 Apr 2015 12:23:29 -0300 (BRT)
From: Marcos Antonio Simplicio Junior <mjunior@...c.usp.br>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] yescrypt AVX2

Hi, 

----- Mensagem original -----

> De: "Solar Designer" <solar@...nwall.com>
> Para: discussions@...sword-hashing.net
> Enviadas: Sábado, 25 de Abril de 2015 5:09:59
> Assunto: [PHC] yescrypt AVX2

> Hi,

> So I thought that AVX2 could be of significant help even when pwxform
> is
> tuned for 128-bit gather lanes, like we want it to be for better
> bcrypt-like anti-GPU on the (currently widespread) machines that can
> only do 128-bit SIMD and not wider yet.

> It appears that I was mostly wrong, speaking of AVX2 as implemented
> in
> Haswell. My expectation for AVX2 working reasonably well even when
> "misused" like that was based on my previous experiments (unreleased)
> with 64-bit bcrypt-like S-box lookups in 128-bit SIMD code, on
> SSE*/AVX
> where we have _mm_loadl_epi64() and _mm_loadh_pi() to efficiently
> load
> 64-bit halves of a 128-bit register. On SSE4.1, there are also
> instructions to extract 32- or 64-bit words from anywhere in a
> 128-bit
> register. I thought it'd be similar for 128-bit loads to 256-bit
> registers. However, it turns out there's a 3 cycle latency to access
> the upper 128-bit halves of 256-bit registers on Haswell, such as
> with
> _mm256_inserti128_si256() and _mm256_extracti128_si256() or
> _mm256_permute4x64_epi64(), and the word-extract instructions do not
> directly operate on the upper 128 bits. So it's 3 cycles to extract
> would-be S-box indices from there, and it's another 3 cycles to load
> the
> S-box lookup result in there. Oops.

> Now, there may be hacks around some of this, such as by doing a
> 256-bit
> load from 16 bytes below the desired offset (thereby loading the
> lower
> 128 bits with garbage and the upper 128 bits with what's needed) and
> then loading the lower 128 bits with what's needed there. I've tried
> this, and it works, but at least without a change to data layout to
> make
> those 256-bit loads always (rather than half of the time) 256-bit
> aligned the performance is no better than with
> _mm256_inserti128_si256(). Also, when implementing at intrinsics
> level,
> I resorted to using a combination of _mm256_loadu_si256() and
> _mm256_blend_epi32(), since I guess _mm256_castsi128_si256() is not
> meant to be used on the left-hand side. When implementing in
> assembly,
> I think it can be two loads (no need for a blend instruction), but I
> doubt it'd be any faster (same instruction count, and blend is fast).
> And I doubt the alignment would make all the difference, as I think
> Haswell has fast unaligned loads from cache (we are occasionally
> crossing a cache line, though).

> Samuel - any thoughts on accessing the upper 128 bits faster?

> So far, the best I achieved when going from AVX to AVX2 yet insisting
> on
> 128-bit S-box lookups is:

> AVX:
> enchmarking 1 thread ...
> 930 c/s real, 930 c/s virtual (1023 hashes in 1.10 seconds)
> Benchmarking 8 threads ...
> 3410 c/s real, 433 c/s virtual (7161 hashes in 2.10 seconds)

> AVX2:
> Benchmarking 1 thread ...
> 710 c/s real, 710 c/s virtual (1023 hashes in 1.44 seconds)
> Benchmarking 8 threads ...
> 3580 c/s real, 458 c/s virtual (7161 hashes in 2.00 seconds)

> at 2 MB per hash on i7-4770K, with otherwise currently default
> yescrypt
> settings (PWXrounds=6, etc.) As you can see, single thread
> performance
> went down by 24%, and 8-thread went up only by 5%.

> It's good news that attackers probably also can't use Haswell to
> attack
> yescrypt hashes tuned for older CPUs (with 128-bit S-box lookups)
> much
> faster. It's bad news that defenders can't benefit from Haswell
> either.

> It looks like I might introduce separate pwxform settings for AVX2,
> using 256-bit S-box lookups. Maybe I'll try 512-bit as well, and see
> if
> it becomes much weaker than bcrypt on Haswell or not (maybe target L2
> cache with the S-boxes, then). (256-bit would definitely be fine in
> terms of bcrypt-like anti-GPU on Haswell.) If it's fine then 512-bit
> would be preferable in order to avoid introducing yet another
> combination of settings for MIC and AVX-512. Note: all of this fits
> into yescrypt and pwxform as currently specified. We're only talking
> of
> adjusting the PWX* compile-time settings (and possibly making them
> runtime tunable), deviating from the current defaults, and producing
> optimized code for those (the -ref will run fine with only #define's
> changed).

> OTOH, we may be satisfied with the current defaults as being a good
> fit
> for up to x86/AVX and ARM/NEON, and appearing to defeat attacks with
> newer CPUs so far. With these settings, Haswell performs almost like
> Sandy/Ivy Bridge do - no worse, but also no better. Maybe that's
> fine.

> I've attached a patch containing my experiments, including several
> AVX2
> code versions of PWXFORM_SIMD. All of those use 128-bit S-boxes so
> far.

> Alexander

Just to share some experience we had on the AVX2 matter: last year, an undergrad (student of a colleague of mine, from a different university) implemented Lyra2 taking advantage of AVX2 and got a 30% speed up with his implementation. 

The results were reported in Portuguese (https://github.com/guilherme-pg/lyra2/blob/master/pfg/relatorio.pdf: see the "30%" figure at the end of page 8) and the specification is not the current one (although similar speed ups are likely to apply to the curretnt version, since most optimization applied to Blake), so I did not mention it earlier, but thinking back we should try to reproduce his results... One more task in our TODO list... :) 

Regards, 

Marcos. 

Content of type "text/html" skipped