phc-discussions - Re: [PHC] Question about saturating the memory bandwidth

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20140119100713.GA18640@openwall.com>
Date: Sun, 19 Jan 2014 14:07:13 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] Question about saturating the memory bandwidth

On Sun, Jan 19, 2014 at 01:13:33PM +0400, Solar Designer wrote:
> With Salsa20/2, I am getting 9300 c/s:
> 
> Benchmarking 1 thread ...
> 845 c/s real, 852 c/s virtual
> Benchmarking 32 threads ...
> 9300 c/s real, 292 c/s virtual

Without ROM (all reads are from RAM), also 1.75 MiB RAM:

Benchmarking 1 thread ...
1248 c/s real, 1255 c/s virtual
Benchmarking 32 threads ...
15429 c/s real, 484 c/s virtual

15429*1.75*4*2^30/10^9 = ~116 GB/s

L3 cache helps much more here (we have 1.75*32 = 56 MiB of data).

> or 4188 c/s with 3.5 MiB RAM:
> 
> Benchmarking 1 thread ...
> 429 c/s real, 429 c/s virtual
> Benchmarking 32 threads ...
> 4188 c/s real, 132 c/s virtual

Benchmarking 1 thread ...
635 c/s real, 639 c/s virtual
Benchmarking 32 threads ...
4774 c/s real, 150 c/s virtual

4774*3.5*4*2^30/10^9 = ~71.8 GB/s

L3 cache is not of as much help, as we're exceeding its size by more and
also because:

> 7 MiB:
> 
> Benchmarking 1 thread ...
> 214 c/s real, 212 c/s virtual
> Benchmarking 32 threads ...
> 1992 c/s real, 62 c/s virtual

Without ROM:

Benchmarking 1 thread ...
321 c/s real, 323 c/s virtual
Benchmarking 32 threads ...
1987 c/s real, 62 c/s virtual

Faster for 1 thread, but almost same speed for 32 threads.

> 14 MiB:
> 
> Benchmarking 1 thread ...
> 101 c/s real, 101 c/s virtual
> Benchmarking 32 threads ...
> 952 c/s real, 30 c/s virtual

Benchmarking 1 thread ...
160 c/s real, 160 c/s virtual
Benchmarking 32 threads ...
892 c/s real, 28 c/s virtual

Now it's even slower for 32 threads.  Why?  I think it's page size.  We
had the ROM in SysV shm on 2 MB pages (explicitly requested), whereas
the RAM is allocated with mmap() without an explicit page size request.
I guess the kernel kept it on 4 KB pages here.  Thus, re-pointing half
the reads from ROM on 2 MB pages to RAM on 4 KB pages may slow them
down, given that the data does not fit in L3 cache (by far) anyway.

This means that when we're using a few MB or more as RAM, we can already
benefit from 2 MB pages.

> To fit in L3 cache, here's 896 KiB (7/8 of a MiB):
> 
> Benchmarking 1 thread ...
> 1637 c/s real, 1650 c/s virtual
> Benchmarking 32 threads ...
> 22250 c/s real, 701 c/s virtual
> 
> 22250*7/8*4*2^30/10^9 = ~83.6 GB/s
> 
> This uses 28 MiB out of 32 MiB L3 cache, but there's possibly some cache
> thrashing by the reads from ROM (even though they're non-temporal, with
> the hint).  Let's try 448 KiB (7/16 of a MiB):
> 
> Benchmarking 1 thread ...
> 3102 c/s real, 3102 c/s virtual
> Benchmarking 32 threads ...
> 57406 c/s real, 1803 c/s virtual
> 
> 57406*7/16*4*2^30/10^9 = ~107.9 GB/s

Same as above, but without ROM:

Benchmarking 1 thread ...
2380 c/s real, 2408 c/s virtual
Benchmarking 32 threads ...
46534 c/s real, 1462 c/s virtual

46534*7/8*4*2^30/10^9 = ~175 GB/s

Benchmarking 1 thread ...
4403 c/s real, 4427 c/s virtual
Benchmarking 32 threads ...
83581 c/s real, 2628 c/s virtual

83581*7/16*4*2^30/10^9 = ~157 GB/s

The efficiency loss with smaller memory size is puzzling.  At these high
speeds, some kind of overhead probably starts to play more of a role.

Finally, with 1 MiB per instance to exactly match L3 cache size (r=8, so
1 KiB blocks):

Benchmarking 1 thread ...
2078 c/s real, 2089 c/s virtual
Benchmarking 32 threads ...
40678 c/s real, 1275 c/s virtual

40678*4*2^30/10^9 = ~175 GB/s

Oh, and with only 1 round of Salsa20:

Benchmarking 1 thread ...
3078 c/s real, 3102 c/s virtual
Benchmarking 32 threads ...
55840 c/s real, 1742 c/s virtual

55840*4*2^30/10^9 = ~240 GB/s

Somehow r=4 is faster in this case (same total allocation size):

Benchmarking 1 thread ...
3125 c/s real, 3150 c/s virtual
Benchmarking 32 threads ...
59062 c/s real, 1846 c/s virtual

59062*4*2^30/10^9 = ~254 GB/s

To remind, for sequential reads from L3 cache (with no processing at
all) I am getting ~400 GB/s.

In all of the above tests, the 32 threads are independent.  They're not
hashing the same password, unlike in my tests with a 2 GiB RAM
allocation that I had posted before.  Those other tests were for KDF
use (to be directly comparable to Bill's).  These high c/s rate tests
are for password hashing use, hence I expect the parallelism to come
from concurrent authentication attempts.

The "*4" factor can be reduced to "*3" or "*2" (with different escrypt
flags), resulting in higher memory usage for same c/s rate, but with
lower or with no TMTO resistance.

Alexander