linux-kernel - Re: [PATCH 3/5] sched/fair: rework load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtD2hDxnHSaa5C_Jrtabb_ogJSgkLE=5UPyystKZqUmzWA@mail.gmail.com>
Date:   Fri, 26 Jul 2019 11:01:05 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Phil Auld <pauld@...hat.com>
Subject: Re: [PATCH 3/5] sched/fair: rework load_balance

On Thu, 25 Jul 2019 at 19:17, Valentin Schneider
<valentin.schneider@....com> wrote:
>
> Hi Vincent,
>
> first batch of questions/comments here...
>
> On 19/07/2019 08:58, Vincent Guittot wrote:
> [...]
> >  kernel/sched/fair.c | 539 ++++++++++++++++++++++++++++------------------------
> >  1 file changed, 289 insertions(+), 250 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 67f0acd..472959df 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3771,7 +3771,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> >               return;
> >       }
> >
> > -     rq->misfit_task_load = task_h_load(p);
> > +     rq->misfit_task_load = task_util_est(p);
>
> Huh, interesting. Why go for utilization?

Mainly because that's what is used to detect a misfit task and not the load

>
> Right now we store the load of the task and use it to pick the "biggest"
> misfit (in terms of load) when there are more than one misfit tasks to
> choose:

But having a big load doesn't mean that you have a big utilization

so you can trig the misfit case because of task A with a big
utilization that doesn't fit to its local cpu. But then select a task
B in detach_tasks that has a small utilization but a big weight and as
a result a higher load
And task B will never trig the misfit UC by itself and should not
steal the pulling opportunity of task A

>
> update_sd_pick_busiest():
> ,----
> | /*
> |  * If we have more than one misfit sg go with the biggest misfit.
> |  */
> | if (sgs->group_type == group_misfit_task &&
> |     sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> |       return false;
> `----
>
> I don't think it makes much sense to maximize utilization for misfit tasks:
> they're over the capacity margin, which exactly means "I can't really tell
> you much on that utilization other than it doesn't fit".
>
> At the very least, this rq field should be renamed "misfit_task_util".

yes. I agree that i should rename the field

>
> [...]
>
> > @@ -7060,12 +7048,21 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >  enum fbq_type { regular, remote, all };
> >
> >  enum group_type {
> > -     group_other = 0,
> > +     group_has_spare = 0,
> > +     group_fully_busy,
> >       group_misfit_task,
> > +     group_asym_capacity,
> >       group_imbalanced,
> >       group_overloaded,
> >  };
> >
> > +enum group_migration {
> > +     migrate_task = 0,
> > +     migrate_util,
> > +     migrate_load,
> > +     migrate_misfit,
>
> Can't we have only 3 imbalance types (task, util, load), and make misfit
> fall in that first one? Arguably it is a special kind of task balance,
> since it would go straight for the active balance, but it would fit a
> `migrate_task` imbalance with a "go straight for active balance" flag
> somewhere.

migrate_misfit uses its own special condition to detect the task that
can be pulled compared to the other ones

>
> [...]
>
> > @@ -7361,19 +7357,46 @@ static int detach_tasks(struct lb_env *env)
> >               if (!can_migrate_task(p, env))
> >                       goto next;
> >
> > -             load = task_h_load(p);
> > +             if (env->src_grp_type == migrate_load) {
> > +                     unsigned long load = task_h_load(p);
> >
> > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > -                     goto next;
> > +                     if (sched_feat(LB_MIN) &&
> > +                         load < 16 && !env->sd->nr_balance_failed)
> > +                             goto next;
> > +
> > +                     if ((load / 2) > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= load;
> > +             } else  if (env->src_grp_type == migrate_util) {
> > +                     unsigned long util = task_util_est(p);
> > +
> > +                     if (util > env->imbalance)
> > +                             goto next;
> > +
> > +                     env->imbalance -= util;
> > +             } else if (env->src_grp_type == migrate_misfit) {
> > +                     unsigned long util = task_util_est(p);
> > +
> > +                     /*
> > +                      * utilization of misfit task might decrease a bit
> > +                      * since it has been recorded. Be conservative in the
> > +                      * condition.
> > +                      */
> > +                     if (2*util < env->imbalance)
> > +                             goto next;
>
> Other than I'm not convinced we should track utilization for misfit tasks,
> can the utilization of the task really be halved between the last
> update_misfit_status() and here?

No it's not but this margin is quite simple to compute and should
cover all cases:
It's high enough to filter others non misfit tasks and small enough to
cope with a decrease of the utilization of the misfit task since we
detected the UC

>
> > +
> > +                     env->imbalance = 0;
> > +             } else {
> > +                     /* Migrate task */
> > +                     env->imbalance--;
> > +             }
> >
> > -             if ((load / 2) > env->imbalance)
> > -                     goto next;
> >
> >               detach_task(p, env);
> >               list_add(&p->se.group_node, &env->tasks);
> >
> >               detached++;
> > -             env->imbalance -= load;
> >
> >  #ifdef CONFIG_PREEMPT
> >               /*
>
> [...]
>
> > @@ -8357,72 +8318,115 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >       if (busiest->group_type == group_imbalanced) {
> >               /*
> >                * In the group_imb case we cannot rely on group-wide averages
> > -              * to ensure CPU-load equilibrium, look at wider averages. XXX
> > +              * to ensure CPU-load equilibrium, try to move any task to fix
> > +              * the imbalance. The next load balance will take care of
> > +              * balancing back the system.
> >                */
> > -             busiest->load_per_task =
> > -                     min(busiest->load_per_task, sds->avg_load);
> > +             env->src_grp_type = migrate_task;
> > +             env->imbalance = 1;
> > +             return;
> >       }
> >
> > -     /*
> > -      * Avg load of busiest sg can be less and avg load of local sg can
> > -      * be greater than avg load across all sgs of sd because avg load
> > -      * factors in sg capacity and sgs with smaller group_type are
> > -      * skipped when updating the busiest sg:
> > -      */
> > -     if (busiest->group_type != group_misfit_task &&
> > -         (busiest->avg_load <= sds->avg_load ||
> > -          local->avg_load >= sds->avg_load)) {
> > -             env->imbalance = 0;
> > -             return fix_small_imbalance(env, sds);
> > +     if (busiest->group_type == group_misfit_task) {
> > +             /* Set imbalance to allow misfit task to be balanced. */
> > +             env->src_grp_type = migrate_misfit;
> > +             env->imbalance = busiest->group_misfit_task_load;
> > +             return;
> >       }
> >
> >       /*
> > -      * If there aren't any idle CPUs, avoid creating some.
> > +      * Try to use spare capacity of local group without overloading it or
> > +      * emptying busiest
> >        */
> > -     if (busiest->group_type == group_overloaded &&
> > -         local->group_type   == group_overloaded) {
> > -             load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > -             if (load_above_capacity > busiest->group_capacity) {
> > -                     load_above_capacity -= busiest->group_capacity;
> > -                     load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > -                     load_above_capacity /= busiest->group_capacity;
> > -             } else
> > -                     load_above_capacity = ~0UL;
> > +     if (local->group_type == group_has_spare) {
> > +             long imbalance;
> > +
> > +             /*
> > +              * If there is no overload, we just want to even the number of
> > +              * idle cpus.
> > +              */
> > +             env->src_grp_type = migrate_task;
> > +             imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
> > +
> > +             if (sds->prefer_sibling)
> > +                     /*
> > +                      * When prefer sibling, evenly spread running tasks on
> > +                      * groups.
> > +                      */
> > +                     imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
> > +
> > +             if (busiest->group_type > group_fully_busy) {
> > +                     /*
> > +                      * If busiest is overloaded, try to fill spare
> > +                      * capacity. This might end up creating spare capacity
> > +                      * in busiest or busiest still being overloaded but
> > +                      * there is no simple way to directly compute the
> > +                      * amount of load to migrate in order to balance the
> > +                      * system.
> > +                      */
> > +                     env->src_grp_type = migrate_util;
> > +                     imbalance = max(local->group_capacity, local->group_util) -
> > +                                 local->group_util;
>
> Rather than filling the local group, shouldn't we follow the same strategy
> as for load, IOW try to reach an average without pushing local above nor
> busiest below ?

But we don't know if this will be enough to make the busiest group not
overloaded anymore

This is a transient state:
a group is overloaded, another one has spare capacity
How to balance the system will depend of how much overload if in the
group and we don't know this value.
The only solution is to:
- try to pull as much task as possible to fill the spare capacity
- Is the group still overloaded ? use avg_load to balance the system
because both group will be overloaded
- Is the group no more overloaded ? balance the number of idle cpus

>
> We could build an sds->avg_util similar to sds->avg_load.

When there is spare capacity, we balances the number of idle cpus

>
> > +             }
> > +
> > +             env->imbalance = imbalance;
> > +             return;
> >       }
> [...]
> > +/*
> > + * Decision matrix according to the local and busiest group state
> > + *
> > + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
> > + * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
> > + * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
> > + * misfit_task      force     N/A        N/A    N/A  force      force
> > + * asym_capacity    force     force      N/A    N/A  force      force
> > + * imbalanced       force     force      N/A    N/A  force      force
> > + * overloaded       force     force      N/A    N/A  force      avg_load
> > + *
> > + * N/A :      Not Applicable because already filtered while updating
> > + *            statistics.
> > + * balanced : The system is balanced for these 2 groups.
> > + * force :    Calculate the imbalance as load migration is probably needed.
> > + * avg_load : Only if imbalance is significant enough.
> > + * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
> > + *            different in groups.
> > + */
> > +
>
> Nice!