phc-discussions - Re: [PHC] GPU benchmarks: Lyra2+yescrypt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150508224512.GA25372@openwall.com>
Date: Sat, 9 May 2015 01:45:12 +0300
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] GPU benchmarks: Lyra2+yescrypt

Hi Marcos,

On Fri, May 08, 2015 at 06:10:10PM -0300, Marcos Simplicio wrote:
> > I find the testing methodology weird and wrong.  You're reporting
> > microseconds.  You should be reporting hashes per second rates 
> 
> OK, hashes/s is indeed a better metric for reporting this. Alone, this
> should not change the ratios shown in the 3rd column, though, so I do
> not agree that the end result is actually wrong. I agree that further
> tests can be made and *added* to the analysis, though, and that is a
> good thing (see below).

The CPU speed figures will change a lot, and this will change the 3rd
column.  Latency on under-loaded CPU can't be directly and reliably
converted to throughput.  We just don't have that data in your results.
(Also, there's a major issue with memory (de)allocation - see below.)

This conversion is valid for GPU, because you're not actually reporting
latency for it - you're already reporting 1/throughput.  I merely ask
that you do the same for CPU, and while at it report throughput directly
rather than via the reciprocals.

> Anyhow, the code was made available so anyone can improve the tests for
> different platforms (note: please try getting the code from the root
> folder, "https://github.com/leocalm/Lyra/", not from the sub-folder I
> sent earlier). We will also include our execution scripts there as soon
> as we clean them up.

Do you mean like:

$ git clone https://github.com/leocalm/Lyra/GPU_attacks
Cloning into 'GPU_attacks'...
error: The requested URL returned error: 403 while accessing https://github.com/leocalm/Lyra/GPU_attacks/info/refs
fatal: HTTP request failed
$ git clone https://github.com/leocalm/Lyra/GPU_attacks -b master GPU_attacks
Cloning into 'GPU_attacks'...
error: The requested URL returned error: 403 while accessing https://github.com/leocalm/Lyra/GPU_attacks/info/refs
fatal: HTTP request failed

> > for full device load, for both CPU and GPU.
> 
> For the CPU, I guess you are proposing we do something similar to what
> Milan did in Test9, right? If so, that sounds reasonable.

Yes, that's right.

For yescrypt, you can also use the included "userom" program for a
sanity check of your results.  You run it like this:

solar@...l:~/yescrypt/yescrypt-0.7.1$ ./userom 0 2
r=8 N=2^11 NROM=2^0
Will use 0.00 KiB ROM
         2048.00 KiB RAM
'$7X$96..../....WZaPV7LSUEKMo34.$IqbBioI2fvqrKNcxDpBdqysJN52oeaKhLk3EQM6t4o/'
Benchmarking 1 thread ...
938 c/s real, 938 c/s virtual (1023 hashes in 1.09 seconds)
Benchmarking 8 threads ...
3426 c/s real, 436 c/s virtual (7161 hashes in 2.09 seconds)

This excludes the memory (de)allocation overhead.  With the overhead
included, the numbers will be somewhat lower.  In fact, userom.c
includes that version, too - through changing "#if 1" on line 266 to an
"#if 0".  Wow, actually it's a lot slower at many threads then:

Benchmarking 8 threads ...
1278 c/s real, 207 c/s virtual (3069 hashes in 2.40 seconds)

I think I figured it out.  I pretty much only cared about either having
the memory (de)allocation out of the loop or KDF use so far.  So I
wanted the resulting allocation to be optimal (page size and alignment),
not the (de)allocation itself to be fast.  Thus, yescrypt-platform.c
prefers to use mmap(), which means there's always two syscalls, and
moreover the kernel zeroes this memory each time before handing it back
to the process.  Simply patching "#undef MAP_ANON" right before
alloc_region() in yescrypt-platform.c brings the speeds to:

Benchmarking 8 threads ...
2790 c/s real, 370 c/s virtual (3069 hashes in 1.10 seconds)

which is 81% of the original speed.  That's far more reasonable.

Forcing use of 2 MB pages (lowering HUGEPAGE_THRESHOLD to below 2 MB)
while staying with mmap() also helped, but not as much:

Benchmarking 8 threads ...
1606 c/s real, 266 c/s virtual (3069 hashes in 1.91 seconds)

Overall, at these sizes the memory (de)allocation overhead can be very
significant.  The 2790 c/s speed above is good, but it is unclear if
it'd be achieved in separate processes each doing just one password
hash, rather than in repeated use by the same process.  Probably not
(but the mmap() speed should be achievable, perhaps by actually using
mmap() like I did).  This isn't in any way specific to yescrypt; it's
just that this code highlighted it.

> Now, for the GPU, we have to disagree: (see
> http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy):
> 
> "Higher occupancy does not always equate to higher performance

Oh, that's just a misunderstanding.  By "full load" on the GPU, I didn't
mean utilizing it beyond the optimal point.  Of course, you need to tune
for optimal performance.  We assume that an attacker will.

To summarize: on CPU do run the maximum number of concurrent threads
even if that's sub-optimal, because this is the scenario that will limit
the cost settings on a server.  On GPU, use optimal settings - whatever
results in the highest throughput.

> Since we were looking for the best throughput, our scripts tried many
> different occupancies until the optimal point was found, just like we
> did for plotting "Figure 20" in Lyra2's Reference Guide. Indeed, 8
> threads per warp gives lower latency (hence, higher throughput) than the
> usual 32 threads per warp, for both algorithms, so the former is optimal
> for the attacker.

"lower latency (hence, higher throughput)" isn't necessarily right when
the number of concurrently computed hashes changes between your tests.

However, overall I agree - you need to tune for the highest throughput.
In fact, I think you should omit results that are not the highest.

> > In a similar spirit, I think we should lock p=1 for these tests, and add
> > the maximum amount of parallelism externally - much like it'd happen on
> > an authentication server or in an attack.  So e.g. on a quad-core Intel
> > CPU with HT, we should run 8 threads externally to both yescrypt and
> > Lyra2, both at p=1. 
> 
> That is a reasonable test.

I'm happy you agree.  I think it's the only test needed for these low
m_cost values.  Except that:

> > And keep the memory (de)allocation out of the loop,
> > if we can - or maybe report both kinds of benchmarks (with this overhead
> > included or excluded, as it can be either depending on how well a given
> > software integration or server deployment has been performed).
> 
> We will start with (de)allocations inside the loop, since that seems to
> have better correspondence with a higher memory usage scenario (and
> takes less effort for now :) )

OK, but it's the less reliable test - the speeds vary too much across
different memory (de)allocation methods and repeated invocations from
the same process vs. newly created processes.  You might be benchmarking
mostly this overhead, which is fundamentally the same for Lyra2 and
yescrypt (so does not help PHC select one or the other), but may be
currently very different because of current implementation detail.

> We will try to use the same approach Milan did in his benchmarks to have
> comparable results.

Sounds good.  From his description, I guess it'd correspond to my tests
above.  It's unclear which version of the memory (de)allocation we want
benchmarked, though.  Maybe several versions, and ensure that Lyra2 and
yescrypt use the same version in each comparison.

> For a sanity check, though: Figure 10 in Milan's benchmarks shows a
> throughput of ~1100 h/s for mcost = 1 MiB, tcost = min (so, yescrypt's
> T=0) and parallel_processes = 1, which is much closer to what our
> latency measurements suggest (~750 h/s) than to your numbers.

Right.  But it's for just one core in use.  Definitely not the case I
optimized for.

> My best
> guess is that most of the difference comes from the different platforms
> (hardware, OS, etc.),not necessarily from methodology. For example,
> considering only processor speed, ours is 2.2 GHz, Milan's is 2.1 GHz,
> and i7-4770K is 3.5 GHz.

I now think the difference must be primarily from the memory
(de)allocation, which is slower for yescrypt's use of mmap()/munmap()
vs. the competitors' use of malloc()/free() when either approach is used
repeatedly from the same process.  With the latter, memory is not
necessarily freed back to the OS, so is not necessarily zeroed when it
is reallocated by the same process when it computes the next hash.
Obviously, zeroing 2 MB got to have a cost.

As to clock rates, you're indeed right that they're very different, but
the difference is smaller than you think since we should be looking at
the relevant turbo rates.  For your E5-2430, it's 2.7 GHz when one core
is in use, vs. i7-4770K's 3.9 GHz.  3.9/2.7 = 1.44 vs. 3.5/2.2 = 1.59.
I think it's similar for turbo rates with multiple cores in use:

http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Xeon_E5-24xx_.28dual-processor.29

This gives 3/3/4/4/5/5 for E5-2430 turbo rates depending on number of
cores in use.  For all cores in use, it'd be 2.2+0.3 = 2.5 GHz.  For
i7-4770K it's 3.7 GHz.  3.7/2.5 = 1.48, which is still slightly smaller
than the 1.59 estimate that we'd arrive at from base clock rates.

(FWIW, I actually measured these clock rates for i7-4770K, and verified
that Wikipedia table's data for some other Xeon E5 CPUs - they matched.
I don't have access to E5-2430 in particular.)

But these clock rate effects are smaller than the memory (de)allocation
effect.  I actually didn't expect it'd be _that_ bad.

BTW, I'd be happy to create you (your team) accounts on our systems, so
that you can benchmark both Lyra2 and yescrypt on a system where I can
directly reproduce and confirm your results.  This will eliminate
needing to consider clock rate differences and such.

Thanks,

Alexander