lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Sat, 22 Feb 2014 08:58:33 -0500
From: Bill Cox <>
Subject: Re: [PHC] avoiding cache thrashing

On Fri, Feb 21, 2014 at 8:59 PM, Solar Designer <> wrote:
> Bill,
> On Thu, Feb 20, 2014 at 08:32:14PM -0500, Bill Cox wrote:
>> This is some pretty mind blowing cache optimization.  I'm still trying
>> to get my head around it.
> I gave it a try, and it's not providing as much benefit as I had hoped
> it would, on the Sandy Bridge machine I tested on.  The good news is
> that this is because my code on SB is somehow not impacted too badly
> even when I read the entire L2 without this trick.  Maybe the trick will
> be more relevant on other CPUs or/and with tighter placed load and
> compute instructions (and with little room for out-of-order), but I have
> no time for such testing right now.
> Here's some curious info I found in the process:
> and in particular:

Thanks for these links!  I can confirm the accuracy of the Ivy Bridge
numbers he reports.  I also found this page useful:

For small L1 sized blocks with high repeat counts and 2 memory hashing
threads, I'm getting very close to his reported max B/W for 16 byte
wide accesses.  Haswell has double the bus width to L1 cache, at 32
bytes wide, which is why I've been using 8 32-bit lanes rather than 4
32-bit lanes in my hashing.  This doubles the size of my minimum
sub-block read size, hurting defense against GPU attacks.  Is 32-bytes
small enough?

If I make my random read accesses 64-bytes instead of 32, I can take
advantage of future 64-byte wide L1 access, getting double the L1
memory bandwidth.  Is it better to plan for that future and cut the
minimum sub-block size to 64 bytes, or is this just too big for GPU

I haven't played with this idea yet, but what if I did an
unpredictable shuffle on each group of 4 32-bit lanes, with a total of
16 rather than 8 lanes?  Would that provide better bandwidth for
future CPUs with 64-byte wide busses to L1 cache, while frustrating
GPU attacks as well as code doing 4-byte random reads?


Powered by blists - more mailing lists