lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 13 Feb 2014 23:00:11 -0500 From: Bill Cox <waywardgeek@...il.com> To: discussions@...sword-hashing.net Subject: Re: [PHC] multiply-hardening (Re: NoelKDF ready for submission) On Thu, Feb 13, 2014 at 6:51 PM, Solar Designer <solar@...nwall.com> wrote: > On Thu, Feb 13, 2014 at 05:36:26PM -0500, Bill Cox wrote: >> I was surprised that (value & mask) caused my Ivy >> Bridge CPU so much grief. I thought 3 cycles would be enough to do >> the mask operation and load from L1 cache. Apparently not... > > I am also surprised. Are you sure your "mask" is small enough that this > fits in L1 cache? Are you sure the data was already in L1 cache? This is a big deal, so I wrote a simple loop to test it: #include <stdint.h> #include <stdio.h> #include <stdlib.h> #define memlen ((1u << 31)/sizeof(uint32_t)) #define blocklen (4096/sizeof(uint32_t)) #define numblocks (memlen/blocklen) int main() { uint32_t *mem = malloc(memlen*sizeof(uint32_t)); uint32_t i; for(i = 0; i < blocklen; i++) { mem[i] = i; } uint32_t value = 1; uint32_t mask = blocklen - 1; uint32_t *prev = mem; uint32_t *to = mem + blocklen; for(i = 1; i < numblocks; i++) { uint32_t j; for(j = 0; j < blocklen; j++) { uint32_t *from = mem + (value & mask); //uint32_t *from = mem + j; value = (value * (*prev++ | 3)) + *from; *to++ = value; } } printf("%u\n", mem[rand() % memlen]); return 0; } I compile this on my Ivy Bridge processor with: gcc -O3 -std=c11 -mavx foo.c Here's the run with simple addressing within a block: noelkdf> time a.out 2310311821 real 0m0.728s user 0m0.620s sys 0m0.100s Here's the timing with value used to compute the next from address: time a.out 173234176 real 0m1.292s user 0m1.210s sys 0m0.080s This was on my 3.4GHz quad-core Ivy Bridge processor. I also ran it at work on our 128GB machine with 8 AMD processors. I'm not sure which one. Here's with simple addressing indexed by the loop variable: bill> !time time ./a.out 2310311821 real 0m1.368s user 0m0.672s sys 0m0.696s Here's using value to compute the address: time ./a.out 173234176 real 0m1.951s user 0m1.152s sys 0m0.800s I can't explain why using value to compute the next address is taking longer. It doesn't make sense to me. Do you see any problems in my test code? Bill
Powered by blists - more mailing lists