phc-discussions - Re: [PHC] avoiding cache thrashing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAOLP8p6rokF4Y+xfwGF5ZPty=3RLnMqOoR4E0kYLBYcPhn371w@mail.gmail.com>
Date: Sat, 22 Feb 2014 08:58:33 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] avoiding cache thrashing

On Fri, Feb 21, 2014 at 8:59 PM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Thu, Feb 20, 2014 at 08:32:14PM -0500, Bill Cox wrote:
>> This is some pretty mind blowing cache optimization.  I'm still trying
>> to get my head around it.
>
> I gave it a try, and it's not providing as much benefit as I had hoped
> it would, on the Sandy Bridge machine I tested on.  The good news is
> that this is because my code on SB is somehow not impacted too badly
> even when I read the entire L2 without this trick.  Maybe the trick will
> be more relevant on other CPUs or/and with tighter placed load and
> compute instructions (and with little room for out-of-order), but I have
> no time for such testing right now.
>
> Here's some curious info I found in the process:
>
> http://www.7-cpu.com and in particular:
> http://www.7-cpu.com/cpu/SandyBridge.html
> http://www.7-cpu.com/cpu/IvyBridge.html
> http://www.7-cpu.com/cpu/Haswell.html
>
> http://www.realworldtech.com/haswell-cpu/5/
>
> http://www.agner.org/optimize/blog/read.php?i=165
> http://software.intel.com/en-us/forums/topic/280663

Thanks for these links!  I can confirm the accuracy of the Ivy Bridge
numbers he reports.  I also found this page useful:

http://www.realworldtech.com/haswell-cpu/5/

For small L1 sized blocks with high repeat counts and 2 memory hashing
threads, I'm getting very close to his reported max B/W for 16 byte
wide accesses.  Haswell has double the bus width to L1 cache, at 32
bytes wide, which is why I've been using 8 32-bit lanes rather than 4
32-bit lanes in my hashing.  This doubles the size of my minimum
sub-block read size, hurting defense against GPU attacks.  Is 32-bytes
small enough?

If I make my random read accesses 64-bytes instead of 32, I can take
advantage of future 64-byte wide L1 access, getting double the L1
memory bandwidth.  Is it better to plan for that future and cut the
minimum sub-block size to 64 bytes, or is this just too big for GPU
defense?

I haven't played with this idea yet, but what if I did an
unpredictable shuffle on each group of 4 32-bit lanes, with a total of
16 rather than 8 lanes?  Would that provide better bandwidth for
future CPUs with 64-byte wide busses to L1 cache, while frustrating
GPU attacks as well as code doing 4-byte random reads?

Bill