lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Feb 2014 23:00:11 -0500
From: Bill Cox <>
Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission)

On Thu, Feb 13, 2014 at 6:51 PM, Solar Designer <> wrote:
> On Thu, Feb 13, 2014 at 05:36:26PM -0500, Bill Cox wrote:
>> I was surprised that (value & mask) caused my Ivy
>> Bridge CPU so much grief.  I thought 3 cycles would be enough to do
>> the mask operation and load from L1 cache.  Apparently not...
> I am also surprised.  Are you sure your "mask" is small enough that this
> fits in L1 cache?  Are you sure the data was already in L1 cache?

This is a big deal, so I wrote a simple loop to test it:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define memlen ((1u << 31)/sizeof(uint32_t))
#define blocklen (4096/sizeof(uint32_t))
#define numblocks (memlen/blocklen)

int main() {
    uint32_t *mem = malloc(memlen*sizeof(uint32_t));
    uint32_t i;
    for(i = 0; i < blocklen; i++) {
        mem[i] = i;
    uint32_t value = 1;
    uint32_t mask = blocklen - 1;
    uint32_t *prev = mem;
    uint32_t *to = mem + blocklen;
    for(i = 1; i < numblocks; i++) {
        uint32_t j;
        for(j = 0; j < blocklen; j++) {
            uint32_t *from = mem + (value & mask);
            //uint32_t *from = mem + j;
            value = (value * (*prev++ | 3)) + *from;
            *to++ = value;
    printf("%u\n", mem[rand() % memlen]);
    return 0;

I compile this on my Ivy Bridge processor with:

gcc -O3 -std=c11 -mavx foo.c

Here's the run with simple addressing within a block:

noelkdf> time a.out

real    0m0.728s
user    0m0.620s
sys     0m0.100s

Here's the timing with value used to compute the next from address:

time a.out

real    0m1.292s
user    0m1.210s
sys     0m0.080s

This was on my 3.4GHz quad-core Ivy Bridge processor.  I also ran it
at work on our 128GB machine with 8 AMD processors.  I'm not sure
which one.  Here's with simple addressing indexed by the loop

bill> !time
time ./a.out

real    0m1.368s
user    0m0.672s
sys     0m0.696s

Here's using value to compute the address:

time ./a.out

real    0m1.951s
user    0m1.152s
sys     0m0.800s

I can't explain why using value to compute the next address is taking
longer.  It doesn't make sense to me.  Do you see any problems in my
test code?


Powered by blists - more mailing lists