linux-kernel - Re: [PATCH] sched/fair: prevent unbounded task iteration in load balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAVg7_p4rXXfiO1=F1pjvAbL9V26qjHvtp2S6qBpp6LuQ@mail.gmail.com>
Date: Thu, 20 Jun 2024 15:45:23 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Josh Don <joshdon@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org, 
	Zhang Qiao <zhangqiao22@...wei.com>
Subject: Re: [PATCH] sched/fair: prevent unbounded task iteration in load balance

Hi Josh,

sorry for the delay

On Thu, 9 May 2024 at 00:35, Josh Don <joshdon@...gle.com> wrote:
>
> b0defa7ae03ec changed the load balancing logic to ignore env.max_loop if
> all tasks examined to that point were pinned. The goal of the patch was
> to make it more likely to be able to detach a task buried in a long list
> of pinned tasks. However, this has the unfortunate side effect of
> creating an O(n) iteration in detach_tasks(), as we now must fully
> iterate every task on a cpu if all or most are pinned. Since this load
> balance code is done with rq lock held, and often in softirq context, it
> is very easy to trigger hard lockups. We observed such hard lockups with
> a user who affined O(10k) threads to a single cpu.
>
> My initial suggestion to Vincent was to revert the original patch and
> instead bump sched_nr_migrate. However, he pointed out that we do still
> want to have a limit on the number of tasks we actually detach (which is
> an expensive operation), but still allow a deeper search.
>
> As a result of the above, this patch now separates the number of tasks
> we migrate from the number of tasks we can search. Now, the search limit
> can be raised while keeping the nr_migrate fixed.

Thanks for your patch.

I went back in the archive to remember how I came up with this patch.
I thought that it was fixing a problem raised by
zhangqiao22@...wei.com but my memory was wrong and it isn't. Instead
it's fixing a theoretical case described in the email thread:
https://lore.kernel.org/lkml/20220818083133.GA536@vingu-book/

So I wonder if it's worth adding a new debugfs entry for a case that
nobody might have ever faced or at least complained about. I know that
I asked you to go in this direction whereas you were originally
proposing a revert. After more time to think about this, I'm no longer
convinced that it's worth adding a new debugfs interface finally.

I put some comments below but maybe it would be better to simply
revert b0defa7ae03ec as nobody yet complains with this problem so far.
Let see what other will say or if someone finally faced the problem
but a revert may finally be a better option.

>
> Signed-off-by: Josh Don <joshdon@...gle.com>
> ---
>  kernel/sched/core.c  |  3 ++-
>  kernel/sched/debug.c |  1 +
>  kernel/sched/fair.c  | 19 ++++++++++++++-----
>  kernel/sched/sched.h |  1 +
>  4 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index db4be4921e7f..820ffa9bbcfe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -144,10 +144,11 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
>  #endif /* CONFIG_SCHED_DEBUG */
>
>  /*
> - * Number of tasks to iterate in a single balance run.
> + * Max number of tasks we can move/iterate in a single balance run.
>   * Limited because this is done with IRQs disabled.
>   */
>  const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
> +const_debug unsigned int sysctl_sched_migrate_search_depth = SCHED_NR_MIGRATE_BREAK;
>
>  __read_mostly int scheduler_running;
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 168eecc209b4..d1701c98f996 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -356,6 +356,7 @@ static __init int sched_init_debug(void)
>         debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
>         debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
>         debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
> +       debugfs_create_u32("migrate_search_depth", 0644, debugfs_sched, &sysctl_sched_migrate_search_depth);
>
>         mutex_lock(&sched_domains_mutex);
>         update_sched_domain_debugfs();
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 34fe6e9490c2..b4b26f39dd45 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8764,6 +8764,9 @@ struct lb_env {
>         unsigned int            loop_break;
>         unsigned int            loop_max;
>
> +       unsigned int            num_detached;
> +       unsigned int            detach_max;
> +
>         enum fbq_type           fbq_type;
>         enum migration_type     migration_type;
>         struct list_head        tasks;
> @@ -9044,11 +9047,11 @@ static int detach_tasks(struct lb_env *env)
>
>                 env->loop++;
>                 /*
> -                * We've more or less seen every task there is, call it quits
> -                * unless we haven't found any movable task yet.
> +                * Call it quits if we're going to exceed our search limit;
> +                * we can't search indefinitely even if we've not found a
> +                * migratable task yet.
>                  */
> -               if (env->loop > env->loop_max &&
> -                   !(env->flags & LBF_ALL_PINNED))
> +               if (env->loop > env->loop_max)
>                         break;
>
>                 /* take a breather every nr_migrate tasks */
> @@ -9116,6 +9119,9 @@ static int detach_tasks(struct lb_env *env)
>                 list_add(&p->se.group_node, &env->tasks);
>
>                 detached++;
> +               env->num_detached++;

Do we really need env->num_detached field ? Can't we save remove it
and set env->detach_max = sysctl_sched_nr_migrate - ld_moved

> +               if (env->num_detached >= env->detach_max)
> +                       break;
>
>  #ifdef CONFIG_PREEMPTION
>                 /*
> @@ -11287,6 +11293,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>         env.src_cpu = busiest->cpu;
>         env.src_rq = busiest;
>
> +       env.num_detached = 0;
> +       env.detach_max = sysctl_sched_nr_migrate;
> +
>         ld_moved = 0;
>         /* Clear this flag as soon as we find a pullable task */
>         env.flags |= LBF_ALL_PINNED;
> @@ -11297,7 +11306,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                  * still unbalanced. ld_moved simply stays zero, so it is
>                  * correctly treated as an imbalance.
>                  */
> -               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +               env.loop_max  = min(sysctl_sched_migrate_search_depth, busiest->nr_running);
>
>  more_balance:

 +               env->detach_max = sysctl_sched_nr_migrate - ld_moved

>                 rq_lock_irqsave(busiest, &rf);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e58a54bda77d..1597175d5f0b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2539,6 +2539,7 @@ extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags);
>
>  extern const_debug unsigned int sysctl_sched_nr_migrate;
>  extern const_debug unsigned int sysctl_sched_migration_cost;
> +extern const_debug unsigned int sysctl_sched_migrate_search_depth;
>
>  extern unsigned int sysctl_sched_base_slice;
>
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>