[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAOLP8p6rokF4Y+xfwGF5ZPty=3RLnMqOoR4E0kYLBYcPhn371w@mail.gmail.com>
Date: Sat, 22 Feb 2014 08:58:33 -0500
From: Bill Cox <waywardgeek@...il.com>
To: discussions@...sword-hashing.net
Subject: Re: [PHC] avoiding cache thrashing
On Fri, Feb 21, 2014 at 8:59 PM, Solar Designer <solar@...nwall.com> wrote:
> Bill,
>
> On Thu, Feb 20, 2014 at 08:32:14PM -0500, Bill Cox wrote:
>> This is some pretty mind blowing cache optimization. I'm still trying
>> to get my head around it.
>
> I gave it a try, and it's not providing as much benefit as I had hoped
> it would, on the Sandy Bridge machine I tested on. The good news is
> that this is because my code on SB is somehow not impacted too badly
> even when I read the entire L2 without this trick. Maybe the trick will
> be more relevant on other CPUs or/and with tighter placed load and
> compute instructions (and with little room for out-of-order), but I have
> no time for such testing right now.
>
> Here's some curious info I found in the process:
>
> http://www.7-cpu.com and in particular:
> http://www.7-cpu.com/cpu/SandyBridge.html
> http://www.7-cpu.com/cpu/IvyBridge.html
> http://www.7-cpu.com/cpu/Haswell.html
>
> http://www.realworldtech.com/haswell-cpu/5/
>
> http://www.agner.org/optimize/blog/read.php?i=165
> http://software.intel.com/en-us/forums/topic/280663
Thanks for these links! I can confirm the accuracy of the Ivy Bridge
numbers he reports. I also found this page useful:
http://www.realworldtech.com/haswell-cpu/5/
For small L1 sized blocks with high repeat counts and 2 memory hashing
threads, I'm getting very close to his reported max B/W for 16 byte
wide accesses. Haswell has double the bus width to L1 cache, at 32
bytes wide, which is why I've been using 8 32-bit lanes rather than 4
32-bit lanes in my hashing. This doubles the size of my minimum
sub-block read size, hurting defense against GPU attacks. Is 32-bytes
small enough?
If I make my random read accesses 64-bytes instead of 32, I can take
advantage of future 64-byte wide L1 access, getting double the L1
memory bandwidth. Is it better to plan for that future and cut the
minimum sub-block size to 64 bytes, or is this just too big for GPU
defense?
I haven't played with this idea yet, but what if I did an
unpredictable shuffle on each group of 4 32-bit lanes, with a total of
16 rather than 8 lanes? Would that provide better bandwidth for
future CPUs with 64-byte wide busses to L1 cache, while frustrating
GPU attacks as well as code doing 4-byte random reads?
Bill
Powered by blists - more mailing lists