phc-discussions - optimizing memory access speed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131231074331.GA21335@openwall.com>
Date: Tue, 31 Dec 2013 11:43:32 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: optimizing memory access speed

Hi,

Here are a few things I bookmarked while working on escrypt:

Software prefetching, non-temporal hints, and line fill buffers:

http://www.roguewave.com/portals/0/products/threadspotter/docs/2010.4/manual/ch05s03.html
http://stackoverflow.com/questions/19472036/does-software-prefetching-allocate-a-line-fill-buffer-lfb

False aliasing:

http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops?lq=1
http://stackoverflow.com/questions/9515482/performance-advantages-of-powers-of-2-sized-data

False aliasing may be triggered by scrypt's alignment of the buffers.
A comment in crypto_scrypt-sse.c says:

/**
 * smix(B, r, N, V, XY):
 * Compute B = SMix_r(B, N).  The input B must be 128r bytes in length;
 * the temporary storage V must be 128rN bytes in length; the temporary
 * storage XY must be 256r + 64 bytes in length.  The value N must be a
 * power of 2 greater than 1.  The arrays B, V, and XY must be aligned to a
 * multiple of 64 bytes.
 */

So all of B, V, and XY are cache line aligned, which ensures individual
accesses don't cross cache line boundaries (and thus don't bring in
extra cache lines unnecessarily, except possibly via a hardware
prefetcher), let alone page boundaries (although V elements, being 128*r
in size, may cross page boundaries, depending on the value of r).

Unfortunately, this also means that nearly simultaneous accesses to the
same cache bank are likely.  Besides cache bank conflicts, also possible
is false aliasing in store buffers.  I've briefly tried deliberately
misaligning B and/or XY, while keeping V aligned, but this did not make
a difference on the few builds and systems I tested this on.  Maybe we
got lucky and the problem does not show up for this code on current
systems, or maybe more testing is needed - or maybe we need to avoid
such alignment of multiple buffers being accessed by the same code, just
in case (if we deliberately misalign B and XY, the price for that is
just 2 cache lines).

I first learned of cache bank conflicts from Roman Rusakov when
optimizing DES for the original Pentium in 1996.  Roman brought to my
attention that Pentium could dual-issue two loads only if they were to
different cache banks, meaning different 32-bit word addresses modulo
cache line width (32 bytes for original Pentium; nowadays it's 64).
Indeed, moving half of the SPE tables by 4 bytes (leaving a 4 byte gap
between tables corresponding to lower and upper 32 bits) improved the
speed by ~10% (IIRC).  It was that simple at the time, with the in-order
CPU and a well-defined and easily described rule.  I think things are
much trickier for today's CPUs, yet the issue may still be seen and may
manifest itself in even worse ways (see those StackOverflow Q/As).

Alexander