Immediate values are used as read mostly variables that are rarely updated. They use code patching to modify the values inscribed in the instruction stream. It provides a way to save precious cache lines that would otherwise have to be used by these variables. There is a generic _imv_read() version, which uses standard global variables, and optimized per architecture imv_read() implementations, which use a load immediate to remove a data cache hit. When the immediate values functionnality is disabled in the kernel, it falls back to global variables. It adds a new rodata section "__imv" to place the pointers to the enable value. Immediate values activation functions sits in kernel/immediate.c. Immediate values refer to the memory address of a previously declared integer. This integer holds the information about the state of the immediate values associated, and must be accessed through the API found in linux/immediate.h. At module load time, each immediate value is checked to see if it must be enabled. It would be the case if the variable they refer to is exported from another module and already enabled. In the early stages of start_kernel(), the immediate values are updated to reflect the state of the variable they refer to. * Why should this be merged * It improves performances on heavy memory I/O workloads. An interesting result shows the potential this infrastructure has by showing the slowdown a simple system call such as getppid() suffers when it is used under heavy user-space cache trashing: Random walk L1 and L2 trashing surrounding a getppid() call: (note: in this test, do_syscal_trace was taken at each system call, see Documentation/immediate.txt in these patches for details) - No memory pressure : getppid() takes 1573 cycles - With memory pressure : getppid() takes 15589 cycles We therefore have a slowdown of 10 times just to get the kernel variables from memory. Another test on the same architecture (Intel P4) measured the memory latency to be 559 cycles. Therefore, each cache line removed from the hot path would improve the syscall time of 3.5% in these conditions. Changelog: - section __imv is already SHF_ALLOC - Because of the wonders of ELF, section 0 has sh_addr and sh_size 0. So the if (immediateindex) is unnecessary here. - Remove module_mutex usage: depend on functions implemented in module.c for that. - Does not update tainted module's immediate values. - remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV(). - imv_read(&var) becomes imv_read(var) because of this. - Adding a new EXPORT_IMV_SYMBOL(_GPL). - remove imv_if(). Should use if (unlikely(imv_read(var))) instead. - Wait until we have gcc support before we add the imv_if macro, since its form may have to change. - Dont't declare the __imv section in vmlinux.lds.h, just put the content in the rodata section. - Simplify interface : remove imv_set_early, keep track of kernel boot status internally. - Remove the ALIGN(8) before the __imv section. It is packed now. - Uses an IPI busy-loop on each CPU with interrupts disabled as a simple, architecture agnostic, update mechanism. - Use imv_* instead of immediate_*. - Updating immediate values, cannot rely on smp_call_function() b/c synchronizing cpus using IPIs leads to deadlocks. Process A held a read lock on tasklist_lock, then process B called apply_imv_update(). Process A received the IPI and begins executing ipi_busy_loop(). Then process C takes a write lock irq on the task list lock, before receiving the IPI. Thus, process A holds up process C, and C can't get an IPI b/c interrupts are disabled. Solve this problem by using a new 'ALL_CPUS' parameter to stop_machine_run(). Which runs a function on all cpus after they are busy looping and have disabled irqs. Since this is done in a new process context, we don't have to worry about interrupted spin_locks. Also, less lines of code. Has survived 24 hours+ of testing... Signed-off-by: Mathieu Desnoyers Signed-off-by: Jason Baron CC: Rusty Russell CC: Adrian Bunk CC: Andi Kleen CC: Christoph Hellwig CC: mingo@elte.hu CC: akpm@osdl.org --- include/asm-generic/vmlinux.lds.h | 3 include/linux/immediate.h | 94 +++++++++++++++++++++++ include/linux/module.h | 16 ++++ init/main.c | 8 ++ kernel/Makefile | 1 kernel/immediate.c | 149 ++++++++++++++++++++++++++++++++++++++ kernel/module.c | 50 ++++++++++++ 7 files changed, 320 insertions(+), 1 deletion(-) Index: linux-2.6-sched-devel/include/linux/immediate.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-sched-devel/include/linux/immediate.h 2008-04-16 11:14:29.000000000 -0400 @@ -0,0 +1,94 @@ +#ifndef _LINUX_IMMEDIATE_H +#define _LINUX_IMMEDIATE_H + +/* + * Immediate values, can be updated at runtime and save cache lines. + * + * (C) Copyright 2007 Mathieu Desnoyers + * + * This file is released under the GPLv2. + * See the file COPYING for more details. + */ + +#ifdef CONFIG_IMMEDIATE + +struct __imv { + unsigned long var; /* Pointer to the identifier variable of the + * immediate value + */ + unsigned long imv; /* + * Pointer to the memory location of the + * immediate value within the instruction. + */ + unsigned char size; /* Type size. */ +} __attribute__ ((packed)); + +#include + +/** + * imv_set - set immediate variable (with locking) + * @name: immediate value name + * @i: required value + * + * Sets the value of @name, taking the module_mutex if required by + * the architecture. + */ +#define imv_set(name, i) \ + do { \ + name##__imv = (i); \ + core_imv_update(); \ + module_imv_update(); \ + } while (0) + +/* + * Internal update functions. + */ +extern void core_imv_update(void); +extern void imv_update_range(const struct __imv *begin, + const struct __imv *end); + +#else + +/* + * Generic immediate values: a simple, standard, memory load. + */ + +/** + * imv_read - read immediate variable + * @name: immediate value name + * + * Reads the value of @name. + */ +#define imv_read(name) _imv_read(name) + +/** + * imv_set - set immediate variable (with locking) + * @name: immediate value name + * @i: required value + * + * Sets the value of @name, taking the module_mutex if required by + * the architecture. + */ +#define imv_set(name, i) (name##__imv = (i)) + +static inline void core_imv_update(void) { } +static inline void module_imv_update(void) { } + +#endif + +#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv +#define DEFINE_IMV(type, name) __typeof__(type) name##__imv + +#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv) +#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv) + +/** + * _imv_read - Read immediate value with standard memory load. + * @name: immediate value name + * + * Force a data read of the immediate value instead of the immediate value + * based mechanism. Useful for __init and __exit section data read. + */ +#define _imv_read(name) (name##__imv) + +#endif Index: linux-2.6-sched-devel/include/linux/module.h =================================================================== --- linux-2.6-sched-devel.orig/include/linux/module.h 2008-04-16 11:07:24.000000000 -0400 +++ linux-2.6-sched-devel/include/linux/module.h 2008-04-16 11:14:29.000000000 -0400 @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -355,6 +356,10 @@ struct module /* The command line arguments (may be mangled). People like keeping pointers to this stuff */ char *args; +#ifdef CONFIG_IMMEDIATE + const struct __imv *immediate; + unsigned int num_immediate; +#endif #ifdef CONFIG_MARKERS struct marker *markers; unsigned int num_markers; @@ -467,6 +472,9 @@ extern void print_modules(void); extern void module_update_markers(void); +extern void _module_imv_update(void); +extern void module_imv_update(void); + #else /* !CONFIG_MODULES... */ #define EXPORT_SYMBOL(sym) #define EXPORT_SYMBOL_GPL(sym) @@ -571,6 +579,14 @@ static inline void module_update_markers { } +static inline void _module_imv_update(void) +{ +} + +static inline void module_imv_update(void) +{ +} + #endif /* CONFIG_MODULES */ struct device_driver; Index: linux-2.6-sched-devel/kernel/module.c =================================================================== --- linux-2.6-sched-devel.orig/kernel/module.c 2008-04-16 11:10:44.000000000 -0400 +++ linux-2.6-sched-devel/kernel/module.c 2008-04-16 11:15:32.000000000 -0400 @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -1716,6 +1717,7 @@ static struct module *load_module(void _ unsigned int unusedcrcindex; unsigned int unusedgplindex; unsigned int unusedgplcrcindex; + unsigned int immediateindex; unsigned int markersindex; unsigned int markersstringsindex; struct module *mod; @@ -1814,6 +1816,7 @@ static struct module *load_module(void _ #ifdef ARCH_UNWIND_SECTION_NAME unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME); #endif + immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv"); /* Don't keep modinfo section */ sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC; @@ -1972,6 +1975,11 @@ static struct module *load_module(void _ mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr; if (gplfuturecrcindex) mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr; +#ifdef CONFIG_IMMEDIATE + mod->immediate = (void *)sechdrs[immediateindex].sh_addr; + mod->num_immediate = + sechdrs[immediateindex].sh_size / sizeof(*mod->immediate); +#endif mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr; if (unusedcrcindex) @@ -2039,11 +2047,16 @@ static struct module *load_module(void _ add_kallsyms(mod, sechdrs, symindex, strindex, secstrings); + if (!mod->taints) { #ifdef CONFIG_MARKERS - if (!mod->taints) marker_update_probe_range(mod->markers, mod->markers + mod->num_markers); #endif +#ifdef CONFIG_IMMEDIATE + imv_update_range(mod->immediate, + mod->immediate + mod->num_immediate); +#endif + } err = module_finalize(hdr, sechdrs, mod); if (err < 0) goto cleanup; @@ -2589,3 +2602,38 @@ void module_update_markers(void) mutex_unlock(&module_mutex); } #endif + +#ifdef CONFIG_IMMEDIATE +/** + * _module_imv_update - update all immediate values in the kernel + * + * Iterate on the kernel core and modules to update the immediate values. + * Module_mutex must be held be the caller. + */ +void _module_imv_update(void) +{ + struct module *mod; + + list_for_each_entry(mod, &modules, list) { + if (mod->taints) + continue; + imv_update_range(mod->immediate, + mod->immediate + mod->num_immediate); + } +} +EXPORT_SYMBOL_GPL(_module_imv_update); + +/** + * module_imv_update - update all immediate values in the kernel + * + * Iterate on the kernel core and modules to update the immediate values. + * Takes module_mutex. + */ +void module_imv_update(void) +{ + mutex_lock(&module_mutex); + _module_imv_update(); + mutex_unlock(&module_mutex); +} +EXPORT_SYMBOL_GPL(module_imv_update); +#endif Index: linux-2.6-sched-devel/kernel/immediate.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-sched-devel/kernel/immediate.c 2008-04-16 11:14:29.000000000 -0400 @@ -0,0 +1,149 @@ +/* + * Copyright (C) 2007 Mathieu Desnoyers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + */ +#include +#include +#include +#include +#include +#include + +#include + +/* + * Kernel ready to execute the SMP update that may depend on trap and ipi. + */ +static int imv_early_boot_complete; +static int wrote_text; + +extern const struct __imv __start___imv[]; +extern const struct __imv __stop___imv[]; + +static int stop_machine_imv_update(void *imv_ptr) +{ + struct __imv *imv = imv_ptr; + + if (!wrote_text) { + text_poke((void *)imv->imv, (void *)imv->var, imv->size); + wrote_text = 1; + smp_wmb(); /* make sure other cpus see that this has run */ + } else + sync_core(); + + flush_icache_range(imv->imv, imv->imv + imv->size); + + return 0; +} + +/* + * imv_mutex nests inside module_mutex. imv_mutex protects builtin + * immediates and module immediates. + */ +static DEFINE_MUTEX(imv_mutex); + + +/** + * apply_imv_update - update one immediate value + * @imv: pointer of type const struct __imv to update + * + * Update one immediate value. Must be called with imv_mutex held. + * It makes sure all CPUs are not executing the modified code by having them + * busy looping with interrupts disabled. + * It does _not_ protect against NMI and MCE (could be a problem with Intel's + * errata if we use immediate values in their code path). + */ +static int apply_imv_update(const struct __imv *imv) +{ + /* + * If the variable and the instruction have the same value, there is + * nothing to do. + */ + switch (imv->size) { + case 1: if (*(uint8_t *)imv->imv + == *(uint8_t *)imv->var) + return 0; + break; + case 2: if (*(uint16_t *)imv->imv + == *(uint16_t *)imv->var) + return 0; + break; + case 4: if (*(uint32_t *)imv->imv + == *(uint32_t *)imv->var) + return 0; + break; + case 8: if (*(uint64_t *)imv->imv + == *(uint64_t *)imv->var) + return 0; + break; + default:return -EINVAL; + } + + if (imv_early_boot_complete) { + kernel_text_lock(); + wrote_text = 0; + stop_machine_run(stop_machine_imv_update, (void *)imv, + ALL_CPUS); + kernel_text_unlock(); + } else + text_poke_early((void *)imv->imv, (void *)imv->var, + imv->size); + return 0; +} + +/** + * imv_update_range - Update immediate values in a range + * @begin: pointer to the beginning of the range + * @end: pointer to the end of the range + * + * Updates a range of immediates. + */ +void imv_update_range(const struct __imv *begin, + const struct __imv *end) +{ + const struct __imv *iter; + int ret; + for (iter = begin; iter < end; iter++) { + mutex_lock(&imv_mutex); + ret = apply_imv_update(iter); + if (imv_early_boot_complete && ret) + printk(KERN_WARNING + "Invalid immediate value. " + "Variable at %p, " + "instruction at %p, size %hu\n", + (void *)iter->imv, + (void *)iter->var, iter->size); + mutex_unlock(&imv_mutex); + } +} +EXPORT_SYMBOL_GPL(imv_update_range); + +/** + * imv_update - update all immediate values in the kernel + * + * Iterate on the kernel core and modules to update the immediate values. + */ +void core_imv_update(void) +{ + /* Core kernel imvs */ + imv_update_range(__start___imv, __stop___imv); +} +EXPORT_SYMBOL_GPL(core_imv_update); + +void __init imv_init_complete(void) +{ + imv_early_boot_complete = 1; +} Index: linux-2.6-sched-devel/init/main.c =================================================================== --- linux-2.6-sched-devel.orig/init/main.c 2008-04-16 11:07:24.000000000 -0400 +++ linux-2.6-sched-devel/init/main.c 2008-04-16 11:15:51.000000000 -0400 @@ -60,6 +60,7 @@ #include #include #include +#include #include #include @@ -103,6 +104,11 @@ static inline void mark_rodata_ro(void) #ifdef CONFIG_TC extern void tc_init(void); #endif +#ifdef CONFIG_IMMEDIATE +extern void imv_init_complete(void); +#else +static inline void imv_init_complete(void) { } +#endif enum system_states system_state; EXPORT_SYMBOL(system_state); @@ -547,6 +553,7 @@ asmlinkage void __init start_kernel(void boot_init_stack_canary(); cgroup_init_early(); + core_imv_update(); local_irq_disable(); early_boot_irqs_off(); @@ -671,6 +678,7 @@ asmlinkage void __init start_kernel(void cpuset_init(); taskstats_init_early(); delayacct_init(); + imv_init_complete(); check_bugs(); Index: linux-2.6-sched-devel/kernel/Makefile =================================================================== --- linux-2.6-sched-devel.orig/kernel/Makefile 2008-04-16 11:07:24.000000000 -0400 +++ linux-2.6-sched-devel/kernel/Makefile 2008-04-16 11:14:29.000000000 -0400 @@ -75,6 +75,7 @@ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o +obj-$(CONFIG_IMMEDIATE) += immediate.o obj-$(CONFIG_MARKERS) += marker.o obj-$(CONFIG_LATENCYTOP) += latencytop.o obj-$(CONFIG_FTRACE) += trace/ Index: linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h =================================================================== --- linux-2.6-sched-devel.orig/include/asm-generic/vmlinux.lds.h 2008-04-16 11:07:23.000000000 -0400 +++ linux-2.6-sched-devel/include/asm-generic/vmlinux.lds.h 2008-04-16 11:14:29.000000000 -0400 @@ -61,6 +61,9 @@ *(.rodata) *(.rodata.*) \ *(__vermagic) /* Kernel version magic */ \ *(__markers_strings) /* Markers: strings */ \ + VMLINUX_SYMBOL(__start___imv) = .; \ + *(__imv) /* Immediate values: pointers */ \ + VMLINUX_SYMBOL(__stop___imv) = .; \ } \ \ .rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \ -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/