[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6e6abc4-84ae-0ddf-eb02-9f0537d4bed1@redhat.com>
Date: Mon, 17 Oct 2022 09:43:28 +0200
From: David Hildenbrand <david@...hat.com>
To: Petr Pavlu <petr.pavlu@...e.com>, mcgrof@...nel.org
Cc: pmladek@...e.com, linux-modules@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 3/4] module: Merge same-name module load requests
On 16.10.22 14:30, Petr Pavlu wrote:
> During a system boot, it can happen that the kernel receives a burst of
> requests to insert the same module but loading it eventually fails
> during its init call. For instance, udev can make a request to insert
> a frequency module for each individual CPU when another frequency module
> is already loaded which causes the init function of the new module to
> return an error.
>
> The module loader currently serializes all such requests, with the
> barrier in add_unformed_module(). This creates a lot of unnecessary work
> and delays the boot. It can prevent udev from loading drivers for other
> devices and might cause timeouts of services waiting on them and
> subsequently a failed boot.
>
> The mentioned serialization was introduced as a side-effect of commit
> 6e6de3dee51a. The kernel before that merged some of same load requests
> although it was more by accident and relied on specific timing. The
> patch brings this behavior back in a more explicit form.
>
> The logic is improved as follows:
> * A check whether a module load matches an already loaded module is
> moved right after a module name is determined. -EEXIST continues to be
> returned if the module exists and is live, -EBUSY is returned if
> a same-name module is going.
Can you clarify why the EBUSY change is needed? Why not simply return
EEXIST?
If you have thread 0 loading the module and thread 1 unloading the
module concurrently, then it's pretty much unpredictable what the
outcome will be either way, no?
Add a random sleep to thread 1 (such that the module is *not* going yet)
and the result will be EEXIST.
I suggest avoiding a EBUSY change unless there is real reason to do so.
User space that concurrently loads and unloads the same module is shaky
already, no?
> * A new reference-counted shared_load_info structure is introduced to
> keep track of duplicate load requests. Two loads are considered
> equivalent if their module name matches. In case a load duplicates
> another running insert, the code waits for its completion and then
> returns -EEXIST or -EBUSY depending on whether it succeeded.
>
> Moving the check for same-name module loads earlier has also a positive
> effect on reducing memory pressure. For instance, David Hildenbrand and
> Lin Liu reported [1] that when KASAN_INLINE is enabled (resulting in
> large module size), with plenty of devices that udev wants to probe and
> with plenty of CPUs that can carry out that probing concurrently, the
> system can actually run out of module vmap space and trigger vmap
> allocation errors. This is fixed by the patch too as it avoids duplicate
> layout_and_allocate() work.
It might we reasonable to add the kernel messages here. Can you also add
the Reported-by?
>
> [1] https://lore.kernel.org/all/20221013180518.217405-1-david@redhat.com/
>
> Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
> Reviewed-by: Petr Mladek <pmladek@...e.com>
> Signed-off-by: Petr Pavlu <petr.pavlu@...e.com>
> ---
> kernel/module/main.c | 217 ++++++++++++++++++++++++++++++-------------
> 1 file changed, 155 insertions(+), 62 deletions(-)
>
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 5288843ca40f..2228c0f725e7 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -66,11 +66,28 @@
> * uses RCU list operations).
> * 2) module_use links,
> * 3) mod_tree.addr_min/mod_tree.addr_max,
> - * 4) list of unloaded_tainted_modules.
> + * 4) list of unloaded_tainted_modules,
> + * 5) list of running_loads.
> */
> DEFINE_MUTEX(module_mutex);
> LIST_HEAD(modules);
>
> +/* Shared information to track duplicate module loads. */
> +struct shared_load_info {
> + char name[MODULE_NAME_LEN];
> + refcount_t refcnt;
> + struct list_head list;
> + struct completion done;
> + int err;
> +};
> +static LIST_HEAD(running_loads);
> +
> +/*
> + * Waiting for a module load when the exact module name is not known, for
> + * example, when resolving symbols from another modules.
> + */
> +static DECLARE_WAIT_QUEUE_HEAD(module_wq);
> +
> /* Work queue for freeing init sections in success case */
> static void do_free_init(struct work_struct *w);
> static DECLARE_WORK(init_free_wq, do_free_init);
> @@ -124,9 +141,6 @@ static void mod_update_bounds(struct module *mod)
> int modules_disabled;
> core_param(nomodule, modules_disabled, bint, 0);
>
> -/* Waiting for a module to finish initializing? */
> -static DECLARE_WAIT_QUEUE_HEAD(module_wq);
> -
> static BLOCKING_NOTIFIER_HEAD(module_notify_list);
>
> int register_module_notifier(struct notifier_block *nb)
> @@ -764,8 +778,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
> strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));
>
> free_module(mod);
> - /* someone could wait for the module in add_unformed_module() */
> - wake_up_interruptible(&module_wq);
> return 0;
> out:
> mutex_unlock(&module_mutex);
> @@ -2373,26 +2385,6 @@ static int post_relocation(struct module *mod, const struct load_info *info)
> return module_finalize(info->hdr, info->sechdrs, mod);
> }
>
> -/* Is this module of this name done loading? No locks held. */
> -static bool finished_loading(const char *name)
> -{
> - struct module *mod;
> - bool ret;
> -
> - /*
> - * The module_mutex should not be a heavily contended lock;
> - * if we get the occasional sleep here, we'll go an extra iteration
> - * in the wait_event_interruptible(), which is harmless.
> - */
> - sched_annotate_sleep();
> - mutex_lock(&module_mutex);
> - mod = find_module_all(name, strlen(name), true);
> - ret = !mod || mod->state == MODULE_STATE_LIVE;
> - mutex_unlock(&module_mutex);
> -
> - return ret;
> -}
> -
> /* Call module constructors. */
> static void do_mod_ctors(struct module *mod)
> {
> @@ -2523,7 +2515,6 @@ static noinline int do_init_module(struct module *mod)
> schedule_work(&init_free_wq);
>
> mutex_unlock(&module_mutex);
> - wake_up_interruptible(&module_wq);
>
> return 0;
>
> @@ -2539,7 +2530,6 @@ static noinline int do_init_module(struct module *mod)
> klp_module_going(mod);
> ftrace_release_mod(mod);
> free_module(mod);
> - wake_up_interruptible(&module_wq);
> return ret;
> }
>
> @@ -2551,43 +2541,138 @@ static int may_init_module(void)
> return 0;
> }
>
> +static struct shared_load_info *
> +shared_load_info_alloc(const struct load_info *info)
> +{
> + struct shared_load_info *shared_info =
> + kzalloc(sizeof(*shared_info), GFP_KERNEL);
> + if (shared_info == NULL)
> + return ERR_PTR(-ENOMEM);
> +
> + strscpy(shared_info->name, info->name, sizeof(shared_info->name));
> + refcount_set(&shared_info->refcnt, 1);
> + INIT_LIST_HEAD(&shared_info->list);
> + init_completion(&shared_info->done);
> + return shared_info;
> +}
> +
> +static void shared_load_info_get(struct shared_load_info *shared_info)
> +{
> + refcount_inc(&shared_info->refcnt);
> +}
> +
> +static void shared_load_info_put(struct shared_load_info *shared_info)
> +{
> + if (refcount_dec_and_test(&shared_info->refcnt))
> + kfree(shared_info);
> +}
> +
> /*
> - * We try to place it in the list now to make sure it's unique before
> - * we dedicate too many resources. In particular, temporary percpu
> + * Check that a module load is unique and make it visible to others. The code
> + * looks for parallel running inserts and already loaded modules. Two inserts
> + * are considered equivalent if their module name matches. In case this load
> + * duplicates another running insert, the code waits for its completion and
> + * then returns -EEXIST or -EBUSY depending on whether it succeeded.
> + *
> + * Detecting early that a load is unique avoids dedicating too many cycles and
> + * resources to bring up the module. In particular, it prevents temporary percpu
> * memory exhaustion.
> + *
> + * Merging same load requests then primarily helps during the boot process. It
> + * can happen that the kernel receives a burst of requests to load the same
> + * module (for example, a same module for each individual CPU) and loading it
> + * eventually fails during its init call. Merging the requests allows that only
> + * one full attempt to load the module is made.
> + *
> + * On a non-error return, it is guaranteed that this load is unique.
> */
> -static int add_unformed_module(struct module *mod)
> +static struct shared_load_info *add_running_load(const struct load_info *info)
> {
> - int err;
> struct module *old;
> + struct shared_load_info *shared_info;
>
> - mod->state = MODULE_STATE_UNFORMED;
> -
> -again:
> mutex_lock(&module_mutex);
> - old = find_module_all(mod->name, strlen(mod->name), true);
> - if (old != NULL) {
> - if (old->state != MODULE_STATE_LIVE) {
> - /* Wait in case it fails to load. */
> +
> + /* Search if there is a running load of a module with the same name. */
> + list_for_each_entry(shared_info, &running_loads, list)
> + if (strcmp(shared_info->name, info->name) == 0) {
> + int err;
> +
> + shared_load_info_get(shared_info);
> mutex_unlock(&module_mutex);
> - err = wait_event_interruptible(module_wq,
> - finished_loading(mod->name));
> - if (err)
> - goto out_unlocked;
> - goto again;
> +
> + err = wait_for_completion_interruptible(
> + &shared_info->done);
> + /*
> + * Return -EBUSY when the parallel load failed for any
> + * reason. This load might end up another way but we are
> + * not going to try.
Why not? Usually "-EAGAIN" signals that user space should retry. But I
hope that we can avoid EBUSY altogether and simply retry here.
I'd suggest shared_load_info_put()+retry.
No need to optimize for corner cases (concurrent load failing so we
don't retry ourselves).
> + */
> + if (!err)
> + err = shared_info->err ? -EBUSY : -EEXIST;
> + shared_load_info_put(shared_info);
> + shared_info = ERR_PTR(err);
> + goto out_unlocked;
> + }
> +
> + /* Search if there is a live module with the given name already. */
> + old = find_module_all(info->name, strlen(info->name), true);
> + if (old != NULL) {
> + if (old->state == MODULE_STATE_LIVE) {
> + shared_info = ERR_PTR(-EEXIST);
> + goto out;
> }
> - err = -EEXIST;
> +
> + /*
> + * Any active load always has its record in running_loads and so
> + * would be found above. This applies independent whether such
> + * a module is currently in MODULE_STATE_UNFORMED,
> + * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its
> + * initialization failed. It therefore means this must be an
> + * older going module and the caller should try later once it is
> + * gone.
> + */
> + WARN_ON(old->state != MODULE_STATE_GOING);
> + shared_info = ERR_PTR(-EBUSY);
As raised above, why not EEXIST? Concurrent loading+unloading is racy
either way.
> goto out;
> }
> - mod_update_bounds(mod);
> - list_add_rcu(&mod->list, &modules);
> - mod_tree_insert(mod);
> - err = 0;
> +
> + /* The load is unique, make it visible to others. */
> + shared_info = shared_load_info_alloc(info);
> + if (IS_ERR(shared_info))
> + goto out;
> + list_add(&shared_info->list, &running_loads);
>
> out:
> mutex_unlock(&module_mutex);
> out_unlocked:
> - return err;
> + return shared_info;
> +}
> +
> +static void finalize_running_load(struct shared_load_info *shared_info, int err)
s/finalize/release? It would be nice if the name could correspond to an
opposite action of "add_running_load".
> +{
> + /* Inform other duplicate inserts that the load finished. */
> + mutex_lock(&module_mutex);
> + list_del(&shared_info->list);
> + shared_info->err = err;
> + mutex_unlock(&module_mutex);
> +
> + complete_all(&shared_info->done);
> + shared_load_info_put(shared_info);
> +
> + /* Tell other modules waiting on this one that it completed loading. */
> + wake_up_interruptible(&module_wq);
> +}
> +
[...]
> sysfs_cleanup:
> mod_sysfs_teardown(mod);
> @@ -2880,15 +2973,15 @@ static int load_module(struct load_info *info, const char __user *uargs,
> /* Unlink carefully: kallsyms could be walking list. */
> list_del_rcu(&mod->list);
> mod_tree_remove(mod);
> - wake_up_interruptible(&module_wq);
> /* Wait for RCU-sched synchronizing before releasing mod->list. */
> synchronize_rcu();
> mutex_unlock(&module_mutex);
> - free_module:
> /* Free lock-classes; relies on the preceding sync_rcu() */
> lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);
>
> module_deallocate(mod, info);
> + free_shared:
Ideally, the label matches what's actually being done. So maybe
"release_shared" if you go with "release_..."
> + finalize_running_load(shared_info, err);
> free_copy:
> free_copy(info, flags);
> return err;
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists