phc-discussions - avoiding cache thrashing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140219142524.GA31253@openwall.com>
Date: Wed, 19 Feb 2014 18:25:24 +0400
From: Solar Designer <solar@...nwall.com>
To: discussions@...sword-hashing.net
Subject: avoiding cache thrashing

Hi,

Here's another idea I'd like to share with this community.

When trying to access multiple layers in the memory subsystem at once -
e.g., L1 cache and L2 cache, or a cache and RAM - a problem is that our
accesses to the larger cache throw data out of the smaller cache, yet we
know we're going to need that data in the smaller cache again very soon.

For convenience, I'll be calling the smaller cache L1 and the larger
cache (or RAM) L2.

We wish we had some local memory instead of (or in addition to) L1
cache, but on general-purpose CPUs we don't.  We can't do anything about
this, or can we?

Yes, we can.  Cache thrashing occurs only when the L1 cache tag of a L2
access matches that of a possible L1 cache access.  So the workaround is
to avoid such L1 cache tag matches between the two types of accesses.
We can do that e.g. by resetting an address bit (in the range that is
part of L1 cache tags) in all L1 accesses, and forcing the same address
bit to 1 in all L2 accesses. (*)

A drawback of this workaround is that it effectively halves our usable
sizes of both caches.  What can we do about that?  Another workaround:
once in a while, alter our choice of address bit that we split the two
types of accesses on.  Each time we do this, we effectively swap a half
of the active and inactive cache halves (for both caches at once).  Of
course, when a previously inactive L1 cache half becomes half-active, it
has only a subset of its original L1'ish data (with the rest having been
thrashed with L2 accesses that we made while that half was inactive).
Thus, we shouldn't do such swapping very often, but we may nevertheless
occasionally do it to make some use of the other halves (and of the data
contained there).

A concern is what happens when multiple threads are running.  When each
thread has its own L1 cache, there's no problem.  However, when some
threads share a L1 cache - perhaps on a CPU with 2+ hardware threads per
core - we have a potential problem: L1 access tags of one thread may
happen to match L2 access tags (for L1 cache) of another thread sharing
the L1 cache.  Then we have L1 cache thrashing again.  To avoid it, we
need to keep the threads' use of categories of L1 cache tags in sync.
If we accept only using one half of each cache's capacity (total for all
threads sharing a L1 cache - thus, at most 1/4 of L1 cache size per
thread instead of at most 1/2 per thread, which would be full use), then
all we need to do is align all threads' memory allocations to cache set
boundary (which is typically achieved along with aligning to page
boundary) and refrain from altering our choice of the selector bit (just
hardcode it).

If we do insist on altering the choice (to more fully use caches'
capacity and the data contained at those other addresses), then we'd
need explicit thread synchronization, which is not always possible:
the multiple threads might not be part of one instance of our hash
computation - instead, they may be the caller's, or they may be part of
different processes or even different VMs happening to run multiple
instances of our software on the same machine.

Thus, there's a tradeoff: either we use half the caches' capacity only
or we likely incur L1 cache thrashing when multiple instances (beyond
the number of physical CPU cores) are run on a machine with multiple
hardware threads sharing a L1 cache.

(*) Since our L1 accesses are likely more frequent than L2 ones and since
we probably use an AND instruction anyway - to extract the index - it
is cheaper to reset the bit for L1 accesses and set it for L2 accesses,
rather than the other way around.  In fact, this same one AND instruction
may also be used to mask multiple indices at once, matching the S-box
count, if that's what we're using.  Thus, larger S-box count helps
amortize the cost of this AND in software, which is desirable since this
cost is fully avoided in ASIC.  The L1 index mask might be e.g.
0x0ef00ef00ef00ef0ULL for indexing 4 S-boxes with 128-bit elements, with
the e's instead of f's showing where our selector bit currently is.  The
lower bits are zero so that we obtain byte offsets right after applying
the mask, saving on the need for shifts or addressing modes with index
scaling.  16-bit (like in this example) or 32-bit sub-fields may be used
for easier extraction of the individual indices on some archs, in some
cases helping avoid shifts too.

Alexander