lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 6 Jul 2007 21:50:09 -0400 From: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> To: "Li, Tong N" <tong.n.li@...el.com> Cc: Andi Kleen <andi@...stfloor.org>, Andrew Morton <akpm@...ux-foundation.org>, Alexey Dobriyan <adobriyan@...il.com>, linux-kernel@...r.kernel.org Subject: Re: [patch 10/10] *Tests* Scheduler profiling - Use immediate values * Li, Tong N (tong.n.li@...el.com) wrote: > > Also cache misses in this situation tend to be much more than 48 > cycles > > (even an K8 with integrated memory controller with fastest DIMMs is > > slower than that) Mathieu probably measured an L2 miss, not a load > from > > RAM. > > Load from RAM can be hundreds of ns in the worst case. > > > > The 48 cycles sounds to me like a memory load in an unloaded system, but > it is quite low. I wonder how it was measured... > > tong > Ah, excellent point. I just re-read through my test case, and I modified it so it initializes the memory first, instead of using the zero page, and now does a pseudo-random walk instead of going through the L2 cache lines sequentially... that will help. Especially since the memory is dual channel and the cpu does some prefetching. The code follows this email for easier review and test. I also did a test that characterize the speed impact of my LTTng marker set in both scenarios with a memory I/O hungry program, also provided below. * Memory hit for a data-based branch Here are the results now, on my 3GHz Pentium 4: number of tests : 100 number of branches per test : 100000 memory hit cycles per iteration (mean) : 636.611 L1 cache hit cycles per iteration (mean) : 89.6413 instruction stream based test, cycles per iteration (mean) : 85.3438 Just getting the pointer from a modulo on a pseudo-random value, doing noting with it, cycles per iteration (mean) : 77.5044 So: Base case: 77.50 cycles instruction stream based test: +7.8394 cycles L1 cache hit based test: +12.1369 cycles Memory load based test: +559.1066 cycles So let's say we have a ping flood coming at (14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) 7674 packets per second. If we put 2 markers for irq entry/exit, it brings us to 15348 markers sites executed per second. (15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 We therefore have a 0.29% slowdown just on this case. Compared to this, the instruction stream based test will cause a slowdown of: (15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004 For a 0.004% slowdown. If we plan to use this for memory allocation, spinlock, and all sort of very high event rate tracing, we can assume it will execute 10 to 100 times more sites per second, which brings us to 0.4% slowdown with the instruction stream based test compared to 29% slowdown with the memory load based test on a system with high memory pressure. * Markers impact under heavy memory load Running a kernel with my LTTng instrumentation set, in a test that generates memory pressure (from userspace) by trashing L1 and L2 caches between calls to getppid() (note: syscall_trace is active and calls a marker upon syscall entry and syscall exit; markers are disarmed). This test is done in user-space, so there are some delays due to IRQs coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 nice level) My first set of results : Linear cache trashing, turned out not to be very interesting, because it seems like the linearity of the memset on a full array is somehow detected and it does not "really" trash the caches. Now the most interesting result : Random walk L1 and L2 trashing surrounding a getppid() call. - Markers compiled out (but syscall_trace execution forced) number of tests : 10000 No memory pressure Reading timestamps takes 108.033 cycles getppid : 1681.4 cycles With memory pressure Reading timestamps takes 102.938 cycles getppid : 15691.6 cycles - With the immediate values based markers: number of tests : 10000 No memory pressure Reading timestamps takes 108.006 cycles getppid : 1681.84 cycles With memory pressure Reading timestamps takes 100.291 cycles getppid : 11793 cycles - With global variables based markers: number of tests : 10000 No memory pressure Reading timestamps takes 107.999 cycles getppid : 1669.06 cycles With memory pressure Reading timestamps takes 102.839 cycles getppid : 12535 cycles The result is quite interesting in that the kernel is slower without markers than with markers. I explain it by the fact that the data accessed is not layed out in the same manner in the cache lines when the markers are compiled in or out. It seems that it aligns the function's data better to compile-in the markers in this case. But since the interesting comparison is between the immediate values and global variables based markers, and because they share the same memory layout, except for the movl being replaced by a movz, we see that the global variable based markers (2 markers) adds 742 cycles to each system call (syscall entry and exit are traced and memory locations for both global variables lie on the same cache line). Therefore, not only is it interesting to use the immediate values to dynamically activate dormant code such as the markers, but I think it should also be considered as a replacement for many of the "read mostly" static variables. Mathieu Testing memory timings, in user-space: ---------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> typedef unsigned long long cycles_t; #define barrier() __asm__ __volatile__("": : :"memory") #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) #define rdtscl(low) \ __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx") #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) static inline cycles_t get_cycles(void) { unsigned long long ret = 0; barrier(); rdtscll(ret); barrier(); return ret; } #define L1_CACHELINE_SIZE 64 #define L2_CACHELINE_SIZE 128 #define ARRAY_SIZE 524288000 #define NR_TESTS 100 #define NR_ITER 100000 char array[ARRAY_SIZE]; unsigned int glob = 0; int main(int argc, char **argv) { char *testval; cycles_t time1, time2; double cycles_per_iter; unsigned int i, j; srandom(get_cycles()); printf("number of tests : %lu\n", NR_TESTS); printf("number of branches per test : %lu\n", NR_ITER); memset(array, 0, ARRAY_SIZE); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { time1 = get_cycles(); for (j = 0; j < NR_ITER; j++) { testval = &array[random() % ARRAY_SIZE]; if (*testval) glob++; } time2 = get_cycles(); cycles_per_iter += (time2 - time1)/(double)NR_ITER; } cycles_per_iter /= (double)NR_TESTS; printf("memory hit cycles per iteration (mean) : %g\n", cycles_per_iter); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { testval = &array[random() % L1_CACHELINE_SIZE]; if (*testval) glob++; time1 = get_cycles(); for (j = 0; j < NR_ITER; j++) { testval = &array[random() % L1_CACHELINE_SIZE]; if (*testval) glob++; } time2 = get_cycles(); cycles_per_iter += (time2 - time1)/(double)NR_ITER; } cycles_per_iter /= (double)NR_TESTS; printf("L1 cache hit cycles per iteration (mean) : %g\n", cycles_per_iter); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { time1 = get_cycles(); for (j = 0; j < NR_ITER; j++) { char condition; asm ( ".align 2\n\t" "0:\n\t" "movb %1,%0;\n\t" : "=r" (condition) : "i" (0)); testval = &array[random() % ARRAY_SIZE]; if (condition) glob++; } time2 = get_cycles(); cycles_per_iter += (time2 - time1)/(double)NR_ITER; } cycles_per_iter /= (double)NR_TESTS; printf("instruction stream based test, cycles per iteration (mean) : %g\n", cycles_per_iter); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { time1 = get_cycles(); for (j = 0; j < NR_ITER; j++) { testval = &array[random() % ARRAY_SIZE]; } time2 = get_cycles(); cycles_per_iter += (time2 - time1)/(double)NR_ITER; } cycles_per_iter /= (double)NR_TESTS; printf("Just getting the pointer, doing noting with it, cycles per iteration (mean) : %g\n", cycles_per_iter); return 0; } ---------------------------------------------------------------------------- And here is the syscall timing under heavy memory I/O load: (gcc -O2) ---------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> typedef unsigned long long cycles_t; #define barrier() __asm__ __volatile__("": : :"memory") #define rdtsc(low,high) \ __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)) #define rdtscl(low) \ __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx") #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) static inline cycles_t get_cycles(void) { unsigned long long ret = 0; barrier(); rdtscll(ret); barrier(); return ret; } #define L1_CACHELINE_SIZE 64 #define L2_CACHELINE_SIZE 128 #define ARRAY_SIZE 1048576 /* 1 MB, size of L2 cache */ char array[ARRAY_SIZE]; #define NR_TESTS 10000 int main(int argc, char **argv) { cycles_t time1, time2; double cycles_per_iter; unsigned int i, j; pid_t pid; printf("number of tests : %lu\n", NR_TESTS); srandom(get_cycles()); printf("No memory pressure\n"); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { time1 = get_cycles(); time2 = get_cycles(); cycles_per_iter += (time2 - time1); } cycles_per_iter /= (double)NR_TESTS; printf("Reading timestamps takes %g cycles\n", cycles_per_iter); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { time1 = get_cycles(); pid = getppid(); time2 = get_cycles(); cycles_per_iter += (time2 - time1); } cycles_per_iter /= (double)NR_TESTS; printf("getppid : %g cycles\n", cycles_per_iter); printf("With memory pressure\n"); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { //memset(array, 1, ARRAY_SIZE); for (j=0; j<ARRAY_SIZE*3; j++) array[random() % ARRAY_SIZE] = 1; time1 = get_cycles(); time2 = get_cycles(); cycles_per_iter += (time2 - time1); } cycles_per_iter /= (double)NR_TESTS; printf("Reading timestamps takes %g cycles\n", cycles_per_iter); cycles_per_iter = 0.0; for (i=0; i<NR_TESTS; i++) { //memset(array, 1, ARRAY_SIZE); for (j=0; j<ARRAY_SIZE*3; j++) array[random() % ARRAY_SIZE] = 1; time1 = get_cycles(); pid = getppid(); time2 = get_cycles(); cycles_per_iter += (time2 - time1); } cycles_per_iter /= (double)NR_TESTS; printf("getppid : %g cycles\n", cycles_per_iter); return 0; } ---------------------------------------------------------------------------- -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists