| lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
|
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAOLP8p6rokF4Y+xfwGF5ZPty=3RLnMqOoR4E0kYLBYcPhn371w@mail.gmail.com> Date: Sat, 22 Feb 2014 08:58:33 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] avoiding cache thrashing On Fri, Feb 21, 2014 at 8:59 PM, Solar Designer <solar@...nwall.com> wrote: > Bill, > > On Thu, Feb 20, 2014 at 08:32:14PM -0500, Bill Cox wrote: >> This is some pretty mind blowing cache optimization. I'm still trying >> to get my head around it. > > I gave it a try, and it's not providing as much benefit as I had hoped > it would, on the Sandy Bridge machine I tested on. The good news is > that this is because my code on SB is somehow not impacted too badly > even when I read the entire L2 without this trick. Maybe the trick will > be more relevant on other CPUs or/and with tighter placed load and > compute instructions (and with little room for out-of-order), but I have > no time for such testing right now. > > Here's some curious info I found in the process: > > http://www.7-cpu.com and in particular: > http://www.7-cpu.com/cpu/SandyBridge.html > http://www.7-cpu.com/cpu/IvyBridge.html > http://www.7-cpu.com/cpu/Haswell.html > > http://www.realworldtech.com/haswell-cpu/5/ > > http://www.agner.org/optimize/blog/read.php?i=165 > http://software.intel.com/en-us/forums/topic/280663 Thanks for these links! I can confirm the accuracy of the Ivy Bridge numbers he reports. I also found this page useful: http://www.realworldtech.com/haswell-cpu/5/ For small L1 sized blocks with high repeat counts and 2 memory hashing threads, I'm getting very close to his reported max B/W for 16 byte wide accesses. Haswell has double the bus width to L1 cache, at 32 bytes wide, which is why I've been using 8 32-bit lanes rather than 4 32-bit lanes in my hashing. This doubles the size of my minimum sub-block read size, hurting defense against GPU attacks. Is 32-bytes small enough? If I make my random read accesses 64-bytes instead of 32, I can take advantage of future 64-byte wide L1 access, getting double the L1 memory bandwidth. Is it better to plan for that future and cut the minimum sub-block size to 64 bytes, or is this just too big for GPU defense? I haven't played with this idea yet, but what if I did an unpredictable shuffle on each group of 4 32-bit lanes, with a total of 16 rather than 8 lanes? Would that provide better bandwidth for future CPUs with 64-byte wide busses to L1 cache, while frustrating GPU attacks as well as code doing 4-byte random reads? Bill
Powered by blists - more mailing lists