linux-kernel - Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtAxnsEDL436zUypLj2XyMQyhgPvJ8q_23835sQxWzGtxw@mail.gmail.com>
Date:   Mon, 25 Jan 2021 11:56:28 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Aubrey Li <aubrey.li@...el.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Aubrey Li <aubrey.li@...ux.intel.com>
Subject: Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the
 same sched_domain level

On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey.li@...el.com> wrote:
>
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
>
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED

We should better detect that when trying to detach task like below

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)

        lockdep_assert_held(&env->src_rq->lock);

+       /*
+        * Another CPU has emptied this runqueue in the meantime.
+        * Just return and leave the load_balance properly.
+        */
+       if (env->src_rq->nr_running <= 1 && !env->loop) {
+               /* Clear the flag as we will not test any task */
+               env->flags &= ~LBF_ALL_PINNED;
+               return 0;
+       }
+
        if (env->imbalance <= 0)
                return 0;


> flag set, and triggers load_balance redo at the same sched_domain level.
>
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
>
> This patch introduces a variable(sched_nr_lb_redo) to limit load balance
> redo times, combined with sysctl_sched_nr_migrate, the max load balance
> cost is reduced from 100+ us to 70+ us, measured on a 4s x86 system with
> 192 logical CPUs.
>
> Cc: Andi Kleen <ak@...ux.intel.com>
> Cc: Tim Chen <tim.c.chen@...ux.intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> Signed-off-by: Aubrey Li <aubrey.li@...ux.intel.com>
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ae7ceba..b59f371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7407,6 +7407,8 @@ struct lb_env {
>         unsigned int            loop;
>         unsigned int            loop_break;
>         unsigned int            loop_max;
> +       unsigned int            redo_cnt;
> +       unsigned int            redo_max;
>
>         enum fbq_type           fbq_type;
>         enum migration_type     migration_type;
> @@ -9525,6 +9527,7 @@ static int should_we_balance(struct lb_env *env)
>         return group_balance_cpu(sg) == env->dst_cpu;
>  }
>
> +static const unsigned int sched_nr_lb_redo = 1;
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -9547,6 +9550,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                 .dst_grpmask    = sched_group_span(sd->groups),
>                 .idle           = idle,
>                 .loop_break     = sched_nr_migrate_break,
> +               .redo_max       = sched_nr_lb_redo,
>                 .cpus           = cpus,
>                 .fbq_type       = all,
>                 .tasks          = LIST_HEAD_INIT(env.tasks),
> @@ -9682,7 +9686,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                          * destination group that is receiving any migrated
>                          * load.
>                          */
> -                       if (!cpumask_subset(cpus, env.dst_grpmask)) {
> +                       if (!cpumask_subset(cpus, env.dst_grpmask) &&
> +                                       ++env.redo_cnt < env.redo_max) {
>                                 env.loop = 0;
>                                 env.loop_break = sched_nr_migrate_break;
>                                 goto redo;
> --
> 2.7.4
>