[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6110982a-bb68-c88b-6fd1-24d2c49d9fd7@suse.com>
Date: Mon, 17 Apr 2023 13:18:14 +0200
From: Petr Pavlu <petr.pavlu@...e.com>
To: Luis Chamberlain <mcgrof@...nel.org>
Cc: christophe.leroy@...roup.eu, tglx@...utronix.de,
peterz@...radead.org, song@...nel.org, rppt@...nel.org,
dave@...olabs.net, willy@...radead.org, vbabka@...e.cz,
mhocko@...e.com, dave.hansen@...ux.intel.com,
colin.i.king@...il.com, jim.cromie@...il.com,
catalin.marinas@....com, jbaron@...mai.com,
rick.p.edgecombe@...el.com, david@...hat.com,
patches@...ts.linux.dev, linux-modules@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, pmladek@...e.com,
prarit@...hat.com, torvalds@...ux-foundation.org,
gregkh@...uxfoundation.org, rafael@...nel.org
Subject: Re: [PATCH v3 3/4] module: add debug stats to help identify memory
pressure
On 4/14/23 07:08, Luis Chamberlain wrote:
> Loading modules with finit_module() can end up using vmalloc(), vmap()
> and vmalloc() again, for a total of up to 3 separate allocations in the
> worst case for a single module. We always kernel_read*() the module,
> that's a vmalloc(). Then vmap() is used for the module decompression,
> and if so the last read buffer is freed as we use the now decompressed
> module buffer to stuff data into our copy module. The last allocation is
> specific to each architectures but pretty much that's generally a series
> of vmalloc() calls or a variation of vmalloc to handle ELF sections with
> special permissions.
>
> Evaluation with new stress-ng module support [1] with just 100 ops
> is proving that you can end up using GiBs of data easily even with all
> care we have in the kernel and userspace today in trying to not load modules
> which are already loaded. 100 ops seems to resemble the sort of pressure a
> system with about 400 CPUs can create on module loading. Although issues
> relating to duplicate module requests due to each CPU inucurring a new
> module reuest is silly and some of these are being fixed, we currently lack
> proper tooling to help diagnose easily what happened, when it happened
> and who likely is to blame -- userspace or kernel module autoloading.
>
> Provide an initial set of stats which use debugfs to let us easily scrape
> post-boot information about failed loads. This sort of information can
> be used on production worklaods to try to optimize *avoiding* redundant
> memory pressure using finit_module().
This looks useful, thanks for it. Some comments below.
>
> There's a few examples that can be provided:
>
> A 255 vCPU system without the next patch in this series applied:
>
> Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
> graphical.target reached after 6.988s in userspace
>
> And 13.58 GiB of virtual memory space lost due to failed module loading:
>
> root@big ~ # cat /sys/kernel/debug/modules/stats
> Mods ever loaded 67
> Mods failed on kread 0
> Mods failed on decompress 0
> Mods failed on becoming 0
> Mods failed on load 1411
> Total module size 11464704
> Total mod text size 4194304
> Failed kread bytes 0
> Failed decompress bytes 0
> Failed becoming bytes 0
> Failed kmod bytes 14588526272
> Virtual mem wasted bytes 14588526272
> Average mod size 171115
> Average mod text size 62602
> Average fail load bytes 10339140
> Duplicate failed modules:
> module-name How-many-times Reason
> kvm_intel 249 Load
> kvm 249 Load
> irqbypass 8 Load
> crct10dif_pclmul 128 Load
> ghash_clmulni_intel 27 Load
> sha512_ssse3 50 Load
> sha512_generic 200 Load
> aesni_intel 249 Load
> crypto_simd 41 Load
> cryptd 131 Load
> evdev 2 Load
> serio_raw 1 Load
> virtio_pci 3 Load
> nvme 3 Load
> nvme_core 3 Load
> virtio_pci_legacy_dev 3 Load
> virtio_pci_modern_dev 3 Load
> t10_pi 3 Load
> virtio 3 Load
> crc32_pclmul 6 Load
> crc64_rocksoft 3 Load
> crc32c_intel 40 Load
> virtio_ring 3 Load
> crc64 3 Load
>
> The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
> next patch in this series applied, shows 226.53 MiB are wasted in virtual
> memory allocations which due to duplicate module requests during boot.
> It also shows an average module memory size of 167.10 KiB and an an
> average module .text + .init.text size of 61.13 KiB. The end shows all
> modules which were detected as duplicate requests and whether or not
> they failed early after just the first kernel_read*() call or late after
> we've already allocated the private space for the module in
> layout_and_allocate(). A system with module decompression would reveal
> more wasted virtual memory space.
>
> We should put effort now into identifying the source of these duplicate
> module requests and trimming these down as much possible. Larger systems
> will obviously show much more wasted virtual memory allocations.
>
> root@...d ~ # cat /sys/kernel/debug/modules/stats
> Mods ever loaded 67
> Mods failed on kread 0
> Mods failed on decompress 0
> Mods failed on becoming 83
> Mods failed on load 16
> Total module size 11464704
> Total mod text size 4194304
> Failed kread bytes 0
> Failed decompress bytes 0
> Failed becoming bytes 228959096
> Failed kmod bytes 8578080
> Virtual mem wasted bytes 237537176
> Average mod size 171115
> Average mod text size 62602
> Avg fail becoming bytes 2758544
> Average fail load bytes 536130
> Duplicate failed modules:
> module-name How-many-times Reason
> kvm_intel 7 Becoming
> kvm 7 Becoming
> irqbypass 6 Becoming & Load
> crct10dif_pclmul 7 Becoming & Load
> ghash_clmulni_intel 7 Becoming & Load
> sha512_ssse3 6 Becoming & Load
> sha512_generic 7 Becoming & Load
> aesni_intel 7 Becoming
> crypto_simd 7 Becoming & Load
> cryptd 3 Becoming & Load
> evdev 1 Becoming
> serio_raw 1 Becoming
> nvme 3 Becoming
> nvme_core 3 Becoming
> t10_pi 3 Becoming
> virtio_pci 3 Becoming
> crc32_pclmul 6 Becoming & Load
> crc64_rocksoft 3 Becoming
> crc32c_intel 3 Becoming
> virtio_pci_modern_dev 2 Becoming
> virtio_pci_legacy_dev 1 Becoming
> crc64 2 Becoming
> virtio 2 Becoming
> virtio_ring 2 Becoming
>
> [0] https://github.com/ColinIanKing/stress-ng.git
> [1] echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs
>
> Signed-off-by: Luis Chamberlain <mcgrof@...nel.org>
> ---
> Documentation/core-api/kernel-api.rst | 22 +-
> kernel/module/Kconfig | 37 +++
> kernel/module/Makefile | 1 +
> kernel/module/decompress.c | 4 +
> kernel/module/internal.h | 74 +++++
> kernel/module/main.c | 65 +++-
> kernel/module/stats.c | 432 ++++++++++++++++++++++++++
> kernel/module/tracking.c | 7 +-
> 8 files changed, 630 insertions(+), 12 deletions(-)
> create mode 100644 kernel/module/stats.c
>
> diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
> index e27728596008..9b3f3e5f5a95 100644
> --- a/Documentation/core-api/kernel-api.rst
> +++ b/Documentation/core-api/kernel-api.rst
> @@ -220,12 +220,30 @@ relay interface
> Module Support
> ==============
>
> -Module Loading
> ---------------
> +Kernel module auto-loading
> +--------------------------
>
> .. kernel-doc:: kernel/module/kmod.c
> :export:
>
> +Module debugging
> +----------------
> +
> +.. kernel-doc:: kernel/module/stats.c
> + :doc: module debugging statistics overview
> +
> +dup_failed_modules - tracks duplicate failed modules
> +****************************************************
> +
> +.. kernel-doc:: kernel/module/stats.c
> + :doc: dup_failed_modules - tracks duplicate failed modules
> +
> +module statistics debugfs counters
> +**********************************
> +
> +.. kernel-doc:: kernel/module/stats.c
> + :doc: module statistics debugfs counters
> +
> Inter Module support
> --------------------
>
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index 424b3bc58f3f..ca277b945a67 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -22,6 +22,43 @@ menuconfig MODULES
>
> if MODULES
>
> +config MODULE_DEBUG
> + bool "Module debugging"
> + depends on DEBUG_FS
> + help
> + Allows you to enable / disable features which can help you debug
> + modules. You don't need these options in produciton systems. You can
> + and probably should enable this prior to making your kernel
> + produciton ready though.
2x typo: produciton -> production.
The last sentence could be misinterpreted to mean that you should enable this
to make your kernel production ready. Not sure, maybe I would just drop this
sentence.
Note that there are plenty of other typos in the added comments and
documentation. Please review them with a spell checker.
> +
> +if MODULE_DEBUG
> +
> +config MODULE_STATS
> + bool "Module statistics"
> + depends on DEBUG_FS
> + help
> + This option allows you to maintain a record of module statistics.
> + For example each all modules size, average size, text size, and
> + failed modules and the size for each of those. For failed
This sentence doesn't quite make sense. I guess it should say something as:
For example, size of all modules, average size, text size, a list of failed
modules and the size for each of those.
> + modules we keep track of module which failed due to either the
> + existing module taking too long to load or that module already
> + was loaded.
> +
> + You should enable this if you are debugging production loads
> + and want to see if userspace or the kernel is doing stupid things
> + with loading modules when it shouldn't or if you want to help
> + optimize userspace / kernel space module autoloading schemes.
> + You might want to do this because failed modules tend to use
> + use up significan amount of memory, and so you'd be doing everyone
Word 'use' is repeated twice.
> + a favor in avoiding these failure proactively.
> +
> + This functionality is also useful for those experimenting with
> + module .text ELF section optimization.
> +
> + If unsure, say N.
> +
> +endif # MODULE_DEBUG
> +
> config MODULE_FORCE_LOAD
> bool "Forced module loading"
> default n
> diff --git a/kernel/module/Makefile b/kernel/module/Makefile
> index 5b1d26b53b8d..52340bce497e 100644
> --- a/kernel/module/Makefile
> +++ b/kernel/module/Makefile
> @@ -21,3 +21,4 @@ obj-$(CONFIG_SYSFS) += sysfs.o
> obj-$(CONFIG_KGDB_KDB) += kdb.o
> obj-$(CONFIG_MODVERSIONS) += version.o
> obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o
> +obj-$(CONFIG_MODULE_STATS) += stats.o
> diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
> index 7ddc87bee274..e97232b125eb 100644
> --- a/kernel/module/decompress.c
> +++ b/kernel/module/decompress.c
> @@ -297,6 +297,10 @@ int module_decompress(struct load_info *info, const void *buf, size_t size)
> ssize_t data_size;
> int error;
>
> +#if defined(CONFIG_MODULE_STATS)
> + info->compressed_len = size;
> +#endif
> +
> /*
> * Start with number of pages twice as big as needed for
> * compressed data.
> diff --git a/kernel/module/internal.h b/kernel/module/internal.h
> index 6ae29bb8836f..9d97a59a9127 100644
> --- a/kernel/module/internal.h
> +++ b/kernel/module/internal.h
> @@ -59,6 +59,9 @@ struct load_info {
> unsigned long mod_kallsyms_init_off;
> #endif
> #ifdef CONFIG_MODULE_DECOMPRESS
> +#ifdef CONFIG_MODULE_STATS
> + unsigned long compressed_len;
> +#endif
> struct page **pages;
> unsigned int max_pages;
> unsigned int used_pages;
> @@ -143,6 +146,77 @@ static inline bool set_livepatch_module(struct module *mod)
> #endif
> }
>
> +/**
> + * enum fail_dup_mod_reason - state at which a duplicate module was detected
> + *
> + * @FAIL_DUP_MOD_BECOMING: the module is read properly, passes all checks but
> + * we've determined that another module with the same name is already loaded
> + * or being processed on our &modules list. This happens on early_mod_check()
> + * right before layout_and_allocate(). The kernel would have already
> + * vmalloc()'d space for the entire module through finit_module(). If
> + * decompression was used two vmap() spaces were used. These failures can
> + * happen when userspace has not seen the module present on the kernel and
> + * tries to load the module multiple times at same time.
> + * @FAIL_DUP_MOD_LOAD: the module has been read properly, passes all validation
> + * checks and the kernel determines that the module was unique and because
> + * of this allocated yet another private kernel copy of the module space in
> + * layout_and_allocate() but after this determined in add_unformed_module()
> + * that another module with the same name is already loaded or being processed.
> + * These failures should be mitigated as much as possible and are indicative
> + * of really fast races in loading modules. Without module decompression
> + * they waste twice as much vmap space. With module decompression three
> + * times the module's size vmap space is wasted.
> + */
> +enum fail_dup_mod_reason {
> + FAIL_DUP_MOD_BECOMING = 0,
> + FAIL_DUP_MOD_LOAD,
> +};
> +
> +#ifdef CONFIG_MODULE_STATS
> +
> +#define mod_stat_add_long(count, var) atomic_long_add(count, var)
> +#define mod_stat_inc(name) atomic_inc(name)
> +
> +extern atomic_long_t total_mod_size;
> +extern atomic_long_t total_text_size;
> +extern atomic_long_t invalid_kread_bytes;
> +extern atomic_long_t invalid_decompress_bytes;
> +
> +extern atomic_t modcount;
> +extern atomic_t failed_kreads;
> +extern atomic_t failed_decompress;
> +struct mod_fail_load {
> + struct list_head list;
> + char name[MODULE_NAME_LEN];
> + atomic_long_t count;
> + unsigned long dup_fail_mask;
> +};
> +
> +int try_add_failed_module(const char *name, size_t len, enum fail_dup_mod_reason reason);
> +void mod_stat_bump_invalid(struct load_info *info, int flags);
> +void mod_stat_bump_becoming(struct load_info *info, int flags);
> +
> +#else
> +
> +#define mod_stat_add_long(name, var)
> +#define mod_stat_inc(name)
> +
> +static inline int try_add_failed_module(const char *name, size_t len,
> + enum fail_dup_mod_reason reason)
> +{
> + return 0;
> +}
> +
> +static inline void mod_stat_bump_invalid(struct load_info *info, int flags)
> +{
> +}
> +
> +static inline void mod_stat_bump_becoming(struct load_info *info, int flags)
> +{
> +}
> +
> +#endif /* CONFIG_MODULE_STATS */
> +
> #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING
> struct mod_unload_taint {
> struct list_head list;
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 75b23257128d..5642d77657a0 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -56,6 +56,7 @@
> #include <linux/dynamic_debug.h>
> #include <linux/audit.h>
> #include <linux/cfi.h>
> +#include <linux/debugfs.h>
> #include <uapi/linux/module.h>
> #include "internal.h"
>
> @@ -87,6 +88,8 @@ struct symsearch {
> enum mod_license license;
> };
>
> +struct dentry *mod_debugfs_root;
> +
> /*
> * Bounds of module memory, for speeding up __module_address.
> * Protected by module_mutex.
> @@ -2500,6 +2503,18 @@ static noinline int do_init_module(struct module *mod)
> {
> int ret = 0;
> struct mod_initfree *freeinit;
> +#if defined(CONFIG_MODULE_STATS)
> + unsigned int text_size = 0, total_size = 0;
> +
> + for_each_mod_mem_type(type) {
> + const struct module_memory *mod_mem = &mod->mem[type];
> + if (mod_mem->size) {
> + total_size += mod_mem->size;
> + if (type == MOD_TEXT || type == MOD_INIT_TEXT)
> + text_size += mod->mem[type].size;
'text_size += mod_mem->size;' would be simpler.
> + }
> + }
> +#endif
>
> freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL);
> if (!freeinit) {
> @@ -2561,6 +2576,7 @@ static noinline int do_init_module(struct module *mod)
> mod->mem[type].base = NULL;
> mod->mem[type].size = 0;
> }
> +
> #ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> /* .BTF is not SHF_ALLOC and will get removed, so sanitize pointer */
> mod->btf_data = NULL;
> @@ -2584,6 +2600,11 @@ static noinline int do_init_module(struct module *mod)
> mutex_unlock(&module_mutex);
> wake_up_all(&module_wq);
>
> + mod_stat_add_long(text_size, &total_text_size);
> + mod_stat_add_long(total_size, &total_mod_size);
> +
> + mod_stat_inc(&modcount);
> +
> return 0;
>
> fail_free_freeinit:
> @@ -2599,6 +2620,7 @@ static noinline int do_init_module(struct module *mod)
> ftrace_release_mod(mod);
> free_module(mod);
> wake_up_all(&module_wq);
> +
> return ret;
> }
>
> @@ -2632,7 +2654,8 @@ static bool finished_loading(const char *name)
> }
>
> /* Must be called with module_mutex held */
> -static int module_patient_check_exists(const char *name)
> +static int module_patient_check_exists(const char *name,
> + enum fail_dup_mod_reason reason)
> {
> struct module *old;
> int err = 0;
> @@ -2655,6 +2678,9 @@ static int module_patient_check_exists(const char *name)
> old = find_module_all(name, strlen(name), true);
> }
>
> + if (try_add_failed_module(name, strlen(name), reason))
> + pr_warn("Could not add fail-tracking for module: %s\n", name);
> +
> /*
> * We are here only when the same module was being loaded. Do
> * not try to load it again right now. It prevents long delays
> @@ -2679,7 +2705,7 @@ static int add_unformed_module(struct module *mod)
> mod->state = MODULE_STATE_UNFORMED;
>
> mutex_lock(&module_mutex);
> - err = module_patient_check_exists(mod->name);
> + err = module_patient_check_exists(mod->name, FAIL_DUP_MOD_LOAD);
> if (err)
> goto out;
>
> @@ -2800,6 +2826,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
> int flags)
> {
> struct module *mod;
> + bool module_allocated = false;
> long err = 0;
> char *after_dashes;
>
> @@ -2839,6 +2866,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
> goto free_copy;
> }
>
> + module_allocated = true;
> +
> audit_log_kern_module(mod->name);
>
> /* Reserve our place in the list. */
> @@ -2983,6 +3012,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
> synchronize_rcu();
> mutex_unlock(&module_mutex);
> free_module:
> + mod_stat_bump_invalid(info, flags);
> /* Free lock-classes; relies on the preceding sync_rcu() */
> for_class_mod_mem_type(type, core_data) {
> lockdep_free_key_range(mod->mem[type].base,
> @@ -2991,6 +3021,13 @@ static int load_module(struct load_info *info, const char __user *uargs,
>
> module_deallocate(mod, info);
> free_copy:
> + /*
> + * The info->len is always set. We distinguish between
> + * failures once the proper module was allocated and
> + * before that.
> + */
> + if (!module_allocated)
> + mod_stat_bump_becoming(info, flags);
> free_copy(info, flags);
> return err;
> }
> @@ -3009,8 +3046,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
> umod, len, uargs);
>
> err = copy_module_from_user(umod, len, &info);
> - if (err)
> + if (err) {
> + mod_stat_inc(&failed_kreads);
> + mod_stat_add_long(len, &invalid_kread_bytes);
> return err;
> + }
>
> return load_module(&info, uargs, 0);
> }
> @@ -3035,14 +3075,20 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
>
> len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
> READING_MODULE);
> - if (len < 0)
> + if (len < 0) {
> + mod_stat_inc(&failed_kreads);
> + mod_stat_add_long(len, &invalid_kread_bytes);
> return len;
> + }
>
> if (flags & MODULE_INIT_COMPRESSED_FILE) {
> err = module_decompress(&info, buf, len);
> vfree(buf); /* compressed data is no longer needed */
> - if (err)
> + if (err) {
> + mod_stat_inc(&failed_decompress);
> + mod_stat_add_long(len, &invalid_decompress_bytes);
> return err;
> + }
> } else {
> info.hdr = buf;
> info.len = len;
> @@ -3216,3 +3262,12 @@ void print_modules(void)
> last_unloaded_module.taints);
> pr_cont("\n");
> }
> +
> +#ifdef CONFIG_MODULE_DEBUG
> +static int module_debugfs_init(void)
> +{
> + mod_debugfs_root = debugfs_create_dir("modules", NULL);
> + return 0;
> +}
> +module_init(module_debugfs_init);
> +#endif
> diff --git a/kernel/module/stats.c b/kernel/module/stats.c
> new file mode 100644
> index 000000000000..d4b5b2b9e6ad
> --- /dev/null
> +++ b/kernel/module/stats.c
> @@ -0,0 +1,432 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Debugging module statistics.
> + *
> + * Copyright (C) 2023 Luis Chamberlain <mcgrof@...nel.org>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/slab.h>
> +#include <linux/list.h>
> +#include <linux/debugfs.h>
> +#include <linux/rculist.h>
> +#include <linux/math.h>
> +
> +#include "internal.h"
> +
> +/**
> + * DOC: module debugging statistics overview
> + *
> + * Enabling CONFIG_MODULE_STATS enables module debugging statistics which
> + * are useful to monitor and root cause memory pressure issues with module
> + * loading. These statistics are useful to allow us to improve production
> + * workloads.
> + *
> + * The current module debugging statistics supported help keep track of module
> + * loading failures to enable improvements either for kernel module
> + * auto-loading usage (request_module()) or interactions with userspace.
> + * Statistics are provided to track of all possible failures in the
> + * finit_module() path and memory wasted in this process space. Each of the
> + * failure counters are associated to a type of module loading failure which
> + * is known to incur a certain amount of memory allocation loss. In the worst
> + * case loading a module will fail after a 3 step memory allocation process:
> + *
> + * a) memory allocated with kernel_read_file_from_fd()
> + * b) module decompression processes the file read from
> + * kernel_read_file_from_fd(), and vmap() is used to map
> + * the decompressed module to a new local buffer which represents
> + * a copy of the decompressed module passed from userspace. The buffer
> + * from kernel_read_file_from_fd() is freed right away.
> + * c) layout_and_allocate() allocates space for the final resting
> + * place where we would keep the module if it were to be processed
> + * successfully.
> + *
> + * If a failure occurs after these three different allocations only one
> + * counters will be incremetned with the summation of the lost bytes incurred
> + * during this failure. Likewise, if a module loading failed only after step b)
> + * a separate counter is used and incremented for the bytes lost during both
> + * of those allocations.
> + *
> + * Virtual memory space can be limited, for example on x86 virtual memory size
> + * defaults to 128 MiB. We should strive to limit and avoid wasting virtual
> + * memory allocations when possible. These module dubugging statistics help
> + * to evaluate how much memory is being wasted on bootup due to module loading
> + * failures.
> + *
> + * All counters are designed to be incremental. Atomic counters are used so to
> + * remain simple and avoid delays and deadlocks.
> + */
> +
> +extern struct dentry *mod_debugfs_root;
Files kernel/module/stats.c and kernel/module/tracking.c both add this extern
declaration. Can it be moved to kernel/module/internal.h?
> +
> +/**
> + * DOC: dup_failed_modules - tracks duplicate failed modules
> + *
> + * Linked list of modules which failed to be loaded because an already existing
> + * module with the same name was already being processed or already loaded.
> + * The finit_module() system call incurs heavy virtual memory allocations. In
> + * the worst case an finit_module() system call can end up allocating virtual
> + * memory 3 times:
> + *
> + * 1) kernel_read_file_from_fd() call uses vmalloc()
> + * 2) optional module decompression uses vmap()
> + * 3) layout_and allocate() can use vzalloc() or an arch specific variation of
> + * vmalloc to deal with ELF sections requiring special permissions
> + *
> + * In practice on a typical boot today most finit_module() calls fail due to
> + * the module with the same name already being loaded or about to be processed.
> + * All virtual memory allocated to these failed modules will be lost with
> + * no functional use.
> + *
> + * To help with this the dup_failed_modules allows us to track modules which
> + * failed to load due to the fact that a module already was loaded or being
> + * processed already. There are only two points at which we can fail such
> + * calls, we list them below along with the number of virtual memory allocation
> + * calls:
> + *
> + * a) FAIL_DUP_MOD_BECOMING: at the end of early_mod_check() before
> + * layout_and_allocate(). This does not yet happen.
> + * - with module decompression: 2 virtual memory allocation calls
> + * - without module decompression: 1 virtual memory allocation calls
> + * b) FAIL_DUP_MOD_LOAD: after layout_and_allocate() on add_unformed_module()
> + * - with module decompression 3 virtual memory allocation calls
> + * - without module decompression 2 virtual memory allocation calls
> + *
> + * We should strive to get this list to be as small as possible. If this list
> + * is not empty it is a reflection of possible work or optimizations possible
> + * either in-kernel or in userspace.
> + */
> +static LIST_HEAD(dup_failed_modules);
> +
> +/**
> + * DOC: module statistics debugfs counters
> + *
> + * The total amount of wasted virtual memory allocation space during module
> + * loading can be computed by adding the total from the summation:
> + *
> + * * @invalid_kread_bytes +
> + * @invalid_decompress_bytes +
> + * @invalid_becoming_bytes +
> + * @invalid_mod_bytes
> + *
> + * The following debugfs counters are available to inspect module loading
> + * failures:
> + *
> + * * total_mod_size: total bytes ever used by all modules we've dealt with on
> + * this system
> + * * total_text_size: total bytes of the .text and .init.text ELF section
> + * sizes we've dealt with on this system
> + * * invalid_kread_bytes: bytes wasted in failures which happen due to
> + * memory allocations with the initial kernel_read_file_from_fd().
> + * kernel_read_file_from_fd() uses vmalloc() and so these are wasted
> + * vmalloc() memory allocations. These should typically not happen unless
> + * your system is under memory pressure.
> + * * invalid_decompress_bytes: number of bytes wasted due to
> + * memory allocations in the module decompression path that use vmap().
> + * These typically should not happen unless your system is under memory
> + * presssure.
> + * * invalid_becoming_bytes: total number of bytes wasted due to
> + * allocations used to read the kernel module userspace wants us to read
> + * before we promote it to be processed to be added to our @modules linked
> + * list. These failures could in theory happen in if we had a check in between
> + * between a successful kernel_read_file_from_fd() call and right before
> + * we allocate the our private memory for the module which would be kept if
> + * the module is successfully loaded. The most common reason for this failure
> + * is when userspace is racing to load a module which it does not yet see
> + * loaded. The first module to succeed in add_unformed_module() will add a
> + * module to our &modules list and subsequent loads of modules with the
> + * same name will error out at the end of early_mod_check(). A check
> + * for module_patient_check_exists() at the end of early_mod_check() could be
> + * added to prevent duplicate allocations on layout_and_allocate() for
> + * modules already being processed. These duplicate failed modules are
> + * non-fatal, however they typically are indicative of userspace not seeing
> + * a module in userspace loaded yet and unecessarily trying to load a
> + * module before the kernel even has a chance to begin to process prior
> + * requests. Although duplicate failures can be non-fatal, we should try to
> + * reduce vmalloc() pressure proactively, so ideally after boot this will
> + * be close to as 0 as possible. If module decompression was used we also
> + * add to this counter the cost of the initial kernel_read_file_from_fd()
> + * of the compressed module. If module decompression was not used the
> + * value represents the total wasted allocations in kernel_read_file_from_fd()
> + * calls for these type of failures. These failures can occur because:
> + *
> + * * module_sig_check() - module signature checks
> + * * elf_validity_cache_copy() - some ELF validation issue
> + * * early_mod_check():
> + *
> + * * blacklisting
> + * * failed to rewrite section headers
> + * * version magic
> + * * live patch requirements didn't check out
> + * * the module was detected as being already present
> + *
> + * * invalid_mod_bytes: these are the total number of bytes lost due to
> + * failures after we did all the sanity checks of the module which userspace
> + * passed to us and after our first check that the module is unique. A
> + * module can still fail to load if we detect the module is loaded after we
> + * allocate space for it with layout_and_allocate(), we do this check right
> + * before processing the module as live and run its initialiation routines.
> + * Note that you have a failure of this type it also means the respective
> + * kernel_read_file_from_fd() memory space was also wasted, and so we
> + * increment this counter with twice the size of the module. Additionally
> + * if you used module decompression the size of the compressed module is
> + * also added to this counter.
> + *
> + * * modcount: how many modules we've loaded in our kernel life time
> + * * failed_kreads: how many modules failed due to failed kernel_read_file_from_fd()
> + * * failed_decompress: how many failed module decompression attempts we've had.
> + * These really should not happen unless your compression / decompression
> + * might be broken.
> + * * failed_becoming: how many modules failed after we kernel_read_file_from_fd()
> + * it and before we allocate memory for it with layout_and_allocate(). This
> + * counter is never incremented if you manage to validate the module and
> + * call layout_and_allocate() for it.
> + * * failed_load_modules: how many modules failed once we've allocated our
> + * private space for our module using layout_and_allocate(). These failures
> + * should hopefully mostly be dealt with already. Races in theory could
> + * still exist here, but it would just mean the kernel had started processing
> + * two threads concurrently up to early_mod_check() and then one just one
> + * thread won. These failures are good signs the kernel or userspace is
> + * doing something seriously stupid or that could be improved. We should
> + * strive to fix these, but it is perhaps not easy to fix them.
> + * A recent example are the modules requests incurred for frequency modules,
> + * a separate module request was being issued for each CPU on a system.
> + */
> +
> +atomic_long_t total_mod_size;
> +atomic_long_t total_text_size;
> +atomic_long_t invalid_kread_bytes;
> +atomic_long_t invalid_decompress_bytes;
> +static atomic_long_t invalid_becoming_bytes;
> +static atomic_long_t invalid_mod_bytes;
> +atomic_t modcount;
> +atomic_t failed_kreads;
> +atomic_t failed_decompress;
> +static atomic_t failed_becoming;
> +static atomic_t failed_load_modules;
> +
> +static const char *mod_fail_to_str(struct mod_fail_load *mod_fail)
> +{
> + if (test_bit(FAIL_DUP_MOD_BECOMING, &mod_fail->dup_fail_mask) &&
> + test_bit(FAIL_DUP_MOD_LOAD, &mod_fail->dup_fail_mask))
> + return "Becoming & Load";
> + if (test_bit(FAIL_DUP_MOD_BECOMING, &mod_fail->dup_fail_mask))
> + return "Becoming";
> + if (test_bit(FAIL_DUP_MOD_LOAD, &mod_fail->dup_fail_mask))
> + return "Load";
> + return "Bug-on-stats";
> +}
> +
> +void mod_stat_bump_invalid(struct load_info *info, int flags)
> +{
> + atomic_long_add(info->len * 2, &invalid_mod_bytes);
> + atomic_inc(&failed_load_modules);
> +#if defined(CONFIG_MODULE_DECOMPRESS)
> + if (flags & MODULE_INIT_COMPRESSED_FILE)
> + atomic_long_add(info->compressed_len, &invalid_mod_byte);
Variable invalid_mod_byte is not declared, should be invalid_mod_bytes.
> +#endif
> +}
> +
> +void mod_stat_bump_becoming(struct load_info *info, int flags)
> +{
> + atomic_inc(&failed_becoming);
> + atomic_long_add(info->len, &invalid_becoming_bytes);
> +#if defined(CONFIG_MODULE_DECOMPRESS)
> + if (flags & MODULE_INIT_COMPRESSED_FILE)
> + atomic_long_add(info->compressed_len, &invalid_becoming_bytes);
> +#endif
> +}
> +
> +int try_add_failed_module(const char *name, size_t len, enum fail_dup_mod_reason reason)
Function try_add_failed_module() is only called from
module_patient_check_exists() which always passes in a NUL-terminated string.
The len parameter could be then dropped and the comparison in
try_add_failed_module() could simply use strcmp().
> +{
> + struct mod_fail_load *mod_fail;
> +
> + list_for_each_entry_rcu(mod_fail, &dup_failed_modules, list,
> + lockdep_is_held(&module_mutex)) {
> + if (strlen(mod_fail->name) == len && !memcmp(mod_fail->name, name, len)) {
> + atomic_long_inc(&mod_fail->count);
> + __set_bit(reason, &mod_fail->dup_fail_mask);
> + goto out;
> + }
> + }
> +
> + mod_fail = kzalloc(sizeof(*mod_fail), GFP_KERNEL);
> + if (!mod_fail)
> + return -ENOMEM;
> + memcpy(mod_fail->name, name, len);
> + __set_bit(reason, &mod_fail->dup_fail_mask);
> + atomic_long_inc(&mod_fail->count);
> + list_add_rcu(&mod_fail->list, &dup_failed_modules);
> +out:
> + return 0;
> +}
Indentation in try_add_failed_module() uses spaces instead of tabs in a few
places.
> +
> +/*
> + * At 64 bytes per module and assuming a 1024 bytes preamble we can fit the
> + * 112 module prints within 8k.
> + *
> + * 1024 + (64*112) = 8k
> + */
> +#define MAX_PREAMBLE 1024
> +#define MAX_FAILED_MOD_PRINT 112
> +#define MAX_BYTES_PER_MOD 64
> +static ssize_t read_file_mod_stats(struct file *file, char __user *user_buf,
> + size_t count, loff_t *ppos)
> +{
> + struct mod_fail_load *mod_fail;
> + unsigned int len, size, count_failed = 0;
> + char *buf;
> + u32 live_mod_count, fkreads, fdecompress, fbecoming, floads;
> + u64 total_size, text_size, ikread_bytes, ibecoming_bytes, idecompress_bytes, imod_bytes,
> + total_virtual_lost;
> +
> + live_mod_count = atomic_read(&modcount);
> + fkreads = atomic_read(&failed_kreads);
> + fdecompress = atomic_read(&failed_decompress);
> + fbecoming = atomic_read(&failed_becoming);
> + floads = atomic_read(&failed_load_modules);
> +
> + total_size = atomic64_read(&total_mod_size);
> + text_size = atomic64_read(&total_text_size);
> + ikread_bytes = atomic64_read(&invalid_kread_bytes);
> + idecompress_bytes = atomic64_read(&invalid_decompress_bytes);
> + ibecoming_bytes = atomic64_read(&invalid_becoming_bytes);
> + imod_bytes = atomic64_read(&invalid_mod_bytes);
> +
> + total_virtual_lost = ikread_bytes + idecompress_bytes + ibecoming_bytes + imod_bytes;
> +
> + size = MAX_PREAMBLE + min((unsigned int)(floads + fbecoming) * MAX_BYTES_PER_MOD,
> + (unsigned int) MAX_FAILED_MOD_PRINT * MAX_BYTES_PER_MOD);
Using
'size = MAX_PREAMBLE + min((unsigned int)(floads + fbecoming), (unsigned int)MAX_FAILED_MOD_PRINT) * MAX_BYTES_PER_MOD;'
is a bit simpler and avoids any theoretical overflow of
'(floads + fbecoming) * MAX_BYTES_PER_MOD'.
> + buf = kzalloc(size, GFP_KERNEL);
> + if (buf == NULL)
> + return -ENOMEM;
> +
> + /* The beginning of our debug preamble */
> + len = scnprintf(buf + 0, size - len, "%25s\t%u\n", "Mods ever loaded", live_mod_count);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on kread", fkreads);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on decompress",
> + fdecompress);
> + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on becoming", fbecoming);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Mods failed on load", floads);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total module size", total_size);
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total mod text size", text_size);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kread bytes", ikread_bytes);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed decompress bytes",
> + idecompress_bytes);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed becoming bytes", ibecoming_bytes);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kmod bytes", imod_bytes);
> +
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Virtual mem wasted bytes", total_virtual_lost);
> +
> + if (live_mod_count && total_size) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod size",
> + DIV_ROUND_UP(total_size, live_mod_count));
> + }
> +
> + if (live_mod_count && text_size) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod text size",
> + DIV_ROUND_UP(text_size, live_mod_count));
> + }
> +
> + /*
> + * We use WARN_ON_ONCE() for the counters to ensure we always have parity
> + * for keeping tabs on a type of failure with one type of byte counter.
> + * The counters for imod_bytes does not increase for fkreads failures
> + * for example, and so on.
> + */
> +
> + WARN_ON_ONCE(ikread_bytes && !fkreads);
> + if (fkreads && ikread_bytes) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail kread bytes",
> + DIV_ROUND_UP(ikread_bytes, fkreads));
> + }
> +
> + WARN_ON_ONCE(ibecoming_bytes && !fbecoming);
> + if (fbecoming && ibecoming_bytes) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail becoming bytes",
> + DIV_ROUND_UP(ibecoming_bytes, fbecoming));
> + }
> +
> + WARN_ON_ONCE(idecompress_bytes && !fdecompress);
> + if (fdecompress && idecompress_bytes) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Avg fail decomp bytes",
> + DIV_ROUND_UP(idecompress_bytes, fdecompress));
> + }
> +
> + WARN_ON_ONCE(imod_bytes && !floads);
> + if (floads && imod_bytes) {
> + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average fail load bytes",
> + DIV_ROUND_UP(imod_bytes, floads));
> + }
> +
> + /* End of our debug preamble header. */
> +
> + /* Catch when we've gone beyond our expected preamble */
> + WARN_ON_ONCE(len >= MAX_PREAMBLE);
> +
> + if (list_empty(&dup_failed_modules))
> + goto out;
> +
> + len += scnprintf(buf + len, size - len, "Duplicate failed modules:\n");
> + len += scnprintf(buf + len, size - len, "%25s\t%15s\t%25s\n",
> + "module-name", "How-many-times", "Reason");
"module-name" -> "Module-name"
> + mutex_lock(&module_mutex);
> +
> +
> + list_for_each_entry_rcu(mod_fail, &dup_failed_modules, list) {
> + if (WARN_ON_ONCE(++count_failed >= MAX_FAILED_MOD_PRINT))
> + goto out_unlock;
> + len += scnprintf(buf + len, size - len, "%25s\t%15llu\t%25s\n", mod_fail->name,
> + atomic64_read(&mod_fail->count), mod_fail_to_str(mod_fail));
> + }
> +out_unlock:
> + mutex_unlock(&module_mutex);
> +out:
> + kfree(buf);
> + return simple_read_from_buffer(user_buf, count, ppos, buf, len);
> +}
> +#undef MAX_PREAMBLE
> +#undef MAX_FAILED_MOD_PRINT
> +#undef MAX_BYTES_PER_MOD
> +
> +static const struct file_operations fops_mod_stats = {
> + .read = read_file_mod_stats,
> + .open = simple_open,
> + .owner = THIS_MODULE,
> + .llseek = default_llseek,
> +};
> +
> +#define mod_debug_add_ulong(name) debugfs_create_ulong(#name, 0400, mod_debugfs_root, (unsigned long *) &name.counter)
> +#define mod_debug_add_atomic(name) debugfs_create_atomic_t(#name, 0400, mod_debugfs_root, &name)
> +static int __init module_stats_init(void)
> +{
> + mod_debug_add_ulong(total_mod_size);
> + mod_debug_add_ulong(total_text_size);
> + mod_debug_add_ulong(invalid_kread_bytes);
> + mod_debug_add_ulong(invalid_decompress_bytes);
> + mod_debug_add_ulong(invalid_becoming_bytes);
> + mod_debug_add_ulong(invalid_mod_bytes);
> +
> + mod_debug_add_atomic(modcount);
> + mod_debug_add_atomic(failed_kreads);
> + mod_debug_add_atomic(failed_decompress);
> + mod_debug_add_atomic(failed_becoming);
> + mod_debug_add_atomic(failed_load_modules);
> +
> + debugfs_create_file("stats", 0400, mod_debugfs_root, mod_debugfs_root, &fops_mod_stats);
> +
> + return 0;
> +}
> +#undef mod_debug_add_ulong
> +#undef mod_debug_add_atomic
> +module_init(module_stats_init);
Function module_stats_init() requires mod_debugfs_root being initialized which
is done in module_debugfs_init(). Both functions are recorded to be called via
module_init(). Just to make sure, is their ordering guaranteed in some way?
> diff --git a/kernel/module/tracking.c b/kernel/module/tracking.c
> index 26d812e07615..16742d1c630c 100644
> --- a/kernel/module/tracking.c
> +++ b/kernel/module/tracking.c
> @@ -15,6 +15,7 @@
> #include "internal.h"
>
> static LIST_HEAD(unloaded_tainted_modules);
> +extern struct dentry *mod_debugfs_root;
>
> int try_add_tainted_module(struct module *mod)
> {
> @@ -120,12 +121,8 @@ static const struct file_operations unloaded_tainted_modules_fops = {
>
> static int __init unloaded_tainted_modules_init(void)
> {
> - struct dentry *dir;
> -
> - dir = debugfs_create_dir("modules", NULL);
> - debugfs_create_file("unloaded_tainted", 0444, dir, NULL,
> + debugfs_create_file("unloaded_tainted", 0444, mod_debugfs_root, NULL,
> &unloaded_tainted_modules_fops);
mod_debugfs_root is initialized in module_debugfs_init() only if
CONFIG_MODULE_DEBUG is set. However, my reading is that feature
CONFIG_MODULE_UNLOAD_TAINT_TRACKING is orthogonal to it and doesn't require
CONFIG_MODULE_DEBUG, so it looks this change breaks this tracking?
> -
> return 0;
> }
> module_init(unloaded_tainted_modules_init);
Cheers,
Petr
Powered by blists - more mailing lists