[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtA7UKL4NJHkMTx=MgohbXqO6kJCwamEjBX-zu3nNO1XLA@mail.gmail.com>
Date: Fri, 26 Jul 2019 16:42:53 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc: linux-kernel <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Quentin Perret <quentin.perret@....com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <Morten.Rasmussen@....com>,
Phil Auld <pauld@...hat.com>
Subject: Re: [PATCH 3/5] sched/fair: rework load_balance
On Fri, 26 Jul 2019 at 15:59, Srikar Dronamraju
<srikar@...ux.vnet.ibm.com> wrote:
>
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> > group_has_spare
> > group_fully_busy
> > group_misfit_task
> > group_asym_capacity
> > group_imbalanced
> > group_overloaded
>
> How is group_fully_busy different from group_overloaded?
group_fully_busy means that tasks have enough compute capacity whereas
group_overloaded means that tasks are competing to use the CPU and
need more compute capacity.
As an example:
a cpu is fully busy with 1 always running task
but a cpu is overloaded with 2 always running tasks or 2 task that
want 75% of the CPU as an example
>
> >
> > Based on the type fo sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
> > also some utilization, a number of task or a type of task:
> > migrate_task
> > migrate_util
> > migrate_load
> > migrate_misfit
>
> Can we club migrate_util and migrate_misfit?
migrate_misfit want to move 1 task whereas migrate_util want to
migrate an amount of utilization which can lead to migrate several
tasks
>
> > @@ -7361,19 +7357,46 @@ static int detach_tasks(struct lb_env *env)
> > if (!can_migrate_task(p, env))
> > goto next;
> >
> > - load = task_h_load(p);
> > + if (env->src_grp_type == migrate_load) {
> > + unsigned long load = task_h_load(p);
> >
> > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > - goto next;
> > + if (sched_feat(LB_MIN) &&
> > + load < 16 && !env->sd->nr_balance_failed)
> > + goto next;
> > +
> > + if ((load / 2) > env->imbalance)
> > + goto next;
>
> I know this existed before too but if the load is exactly or around 2x of
> env->imbalance, the resultant imbalance after the load balance operation
> would still be around env->imbalance. We may lose some cache affinity too.
>
> Can we do something like.
> if (2 * load > 3 * env->imbalance)
> goto next;
TBH, I don't know what should be the best value and it's probably
worth doing some investigation but i would prefer to do that as a
separate patch to get a similar behavior in the overloaded case
Why do you propose 3/2 instead of 2 ?
>
> > @@ -7690,14 +7711,14 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
> > *sds = (struct sd_lb_stats){
> > .busiest = NULL,
> > .local = NULL,
> > - .total_running = 0UL,
> > .total_load = 0UL,
> > .total_capacity = 0UL,
> > .busiest_stat = {
> > .avg_load = 0UL,
> > .sum_nr_running = 0,
> > .sum_h_nr_running = 0,
> > - .group_type = group_other,
> > + .idle_cpus = UINT_MAX,
>
> Why are we initializing idle_cpus to UINT_MAX? Shouldnt this be set to 0?
This is the default busiest statistics attached to env
> I only see an increment and compare.
In update_sd_pick_busiest(), we look for the group_has_spare_capacity
with lowest number of idle cpus which we expect to be the busiest.
So the 1st group with spare capacity will have for sure less idle_cpus
and will replace the default one
>
> > + .group_type = group_has_spare,
> > },
> > };
> > }
> >
> > static inline enum
> > -group_type group_classify(struct sched_group *group,
> > +group_type group_classify(struct lb_env *env,
> > + struct sched_group *group,
> > struct sg_lb_stats *sgs)
> > {
> > - if (sgs->group_no_capacity)
> > + if (group_is_overloaded(env, sgs))
> > return group_overloaded;
> >
> > if (sg_imbalanced(group))
> > @@ -7953,7 +7975,13 @@ group_type group_classify(struct sched_group *group,
> > if (sgs->group_misfit_task_load)
> > return group_misfit_task;
> >
> > - return group_other;
> > + if (sgs->group_asym_capacity)
> > + return group_asym_capacity;
> > +
> > + if (group_has_capacity(env, sgs))
> > + return group_has_spare;
> > +
> > + return group_fully_busy;
>
> If its not overloaded but also doesnt have capacity.
> Does it mean its capacity is completely filled.
> Cant we consider that as same as overloaded?
I have answered to this in the 1st question
>
> > }
> >
> >
> > - if (sgs->sum_h_nr_running)
> > - sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
> > + sgs->group_capacity = group->sgc->capacity;
> >
> > sgs->group_weight = group->group_weight;
> >
> > - sgs->group_no_capacity = group_is_overloaded(env, sgs);
> > - sgs->group_type = group_classify(group, sgs);
> > + sgs->group_type = group_classify(env, group, sgs);
> > +
> > + /* Computing avg_load makes sense only when group is overloaded */
> > + if (sgs->group_type != group_overloaded)
> > + sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
> > + sgs->group_capacity;
>
> Mismatch in comment and code?
I may need to add more comments but at this step, the group should be
either overloaded or fully busy but it can also be imbalanced.
In case of a group fully busy or imbalanced (sgs->group_type !=
group_overloaded), we haven't computed avg_load yet so we have to do
so because:
-In the case of fully_busy, we are going to be overloaded which the
next step after fully busy when you are about to pull more load
-In case of imbalance, we don't know the real state of the local group
so we fall back to this default behavior
>
> > @@ -8079,11 +8120,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > if (sgs->group_type < busiest->group_type)
> > return false;
> >
> > - if (sgs->avg_load <= busiest->avg_load)
> > + /* Select the overloaded group with highest avg_load */
> > + if (sgs->group_type == group_overloaded &&
> > + sgs->avg_load <= busiest->avg_load)
> > + return false;
> > +
> > + /* Prefer to move from lowest priority CPU's work */
> > + if (sgs->group_type == group_asym_capacity && sds->busiest &&
> > + sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > return false;
>
> I thought this should have been
> /* Prefer to move from lowest priority CPU's work */
> if (sgs->group_type == group_asym_capacity && sds->busiest &&
> sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> return true;
Here we want to select the "busiest" group_asym_capacity which means
the one with the lowest priority
If sg->asym_prefer_cpu is prefered to be used instead of
sds->busiest->asym_prefer_cpu, we should keep busiest as the group to
be emptied and return false to not replace the latter
>
> > @@ -8357,72 +8318,115 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > if (busiest->group_type == group_imbalanced) {
> > /*
> > * In the group_imb case we cannot rely on group-wide averages
> > - * to ensure CPU-load equilibrium, look at wider averages. XXX
> > + * to ensure CPU-load equilibrium, try to move any task to fix
> > + * the imbalance. The next load balance will take care of
> > + * balancing back the system.
> > */
> > - busiest->load_per_task =
> > - min(busiest->load_per_task, sds->avg_load);
> > + env->src_grp_type = migrate_task;
> > + env->imbalance = 1;
> > + return;
> > }
> >
> > - /*
> > - * Avg load of busiest sg can be less and avg load of local sg can
> > - * be greater than avg load across all sgs of sd because avg load
> > - * factors in sg capacity and sgs with smaller group_type are
> > - * skipped when updating the busiest sg:
> > - */
> > - if (busiest->group_type != group_misfit_task &&
> > - (busiest->avg_load <= sds->avg_load ||
> > - local->avg_load >= sds->avg_load)) {
> > - env->imbalance = 0;
> > - return fix_small_imbalance(env, sds);
> > + if (busiest->group_type == group_misfit_task) {
> > + /* Set imbalance to allow misfit task to be balanced. */
> > + env->src_grp_type = migrate_misfit;
> > + env->imbalance = busiest->group_misfit_task_load;
> > + return;
> > }
> >
> > /*
> > - * If there aren't any idle CPUs, avoid creating some.
> > + * Try to use spare capacity of local group without overloading it or
> > + * emptying busiest
> > */
> > - if (busiest->group_type == group_overloaded &&
> > - local->group_type == group_overloaded) {
> > - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
> > - if (load_above_capacity > busiest->group_capacity) {
> > - load_above_capacity -= busiest->group_capacity;
> > - load_above_capacity *= scale_load_down(NICE_0_LOAD);
> > - load_above_capacity /= busiest->group_capacity;
> > - } else
> > - load_above_capacity = ~0UL;
> > + if (local->group_type == group_has_spare) {
> > + long imbalance;
> > +
> > + /*
> > + * If there is no overload, we just want to even the number of
> > + * idle cpus.
> > + */
> > + env->src_grp_type = migrate_task;
> > + imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1);
>
> Shouldnt this be?
> imbalance = max_t(long, 0, (busiest->idle_cpus - local->idle_cpus) >> 1);
local has more idle_cpus than busiest otherwise we don't try to pull task
> > +
> > + if (sds->prefer_sibling)
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1;
> > +
> > + if (busiest->group_type > group_fully_busy) {
> > + /*
> > + * If busiest is overloaded, try to fill spare
> > + * capacity. This might end up creating spare capacity
> > + * in busiest or busiest still being overloaded but
> > + * there is no simple way to directly compute the
> > + * amount of load to migrate in order to balance the
> > + * system.
> > + */
> > + env->src_grp_type = migrate_util;
> > + imbalance = max(local->group_capacity, local->group_util) -
> > + local->group_util;
> > + }
> > +
> > + env->imbalance = imbalance;
> > + return;
> > }
> >
> > /*
> > - * We're trying to get all the CPUs to the average_load, so we don't
> > - * want to push ourselves above the average load, nor do we wish to
> > - * reduce the max loaded CPU below the average load. At the same time,
> > - * we also don't want to reduce the group load below the group
> > - * capacity. Thus we look for the minimum possible imbalance.
> > + * Local is fully busy but have to take more load to relieve the
> > + * busiest group
> > */
> > - max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
> > + if (local->group_type < group_overloaded) {
>
>
> What action are we taking if we find the local->group_type to be group_imbalanced
> or group_misfit ? We will fall here but I dont know if it right to look for
> avg_load in that case.
local->group_type can't be misfit
For local->group_type is imbalance , I answered in a previous comment
>
> > + /*
> > + * Local will become overvloaded so the avg_load metrics are
> > + * finally needed
> > + */
> >
> > - /* How much load to actually move to equalise the imbalance */
> > - env->imbalance = min(
> > - max_pull * busiest->group_capacity,
> > - (sds->avg_load - local->avg_load) * local->group_capacity
> > - ) / SCHED_CAPACITY_SCALE;
> > + local->avg_load = (local->group_load*SCHED_CAPACITY_SCALE)
> > + / local->group_capacity;
> >
> > - /* Boost imbalance to allow misfit task to be balanced. */
> > - if (busiest->group_type == group_misfit_task) {
> > - env->imbalance = max_t(long, env->imbalance,
> > - busiest->group_misfit_task_load);
> > + sds->avg_load = (SCHED_CAPACITY_SCALE * sds->total_load)
> > + / sds->total_capacity;
> > }
> >
> > /*
> > - * if *imbalance is less than the average load per runnable task
> > - * there is no guarantee that any tasks will be moved so we'll have
> > - * a think about bumping its value to force at least one task to be
> > - * moved
> > + * Both group are or will become overloaded and we're trying to get
> > + * all the CPUs to the average_load, so we don't want to push
> > + * ourselves above the average load, nor do we wish to reduce the
> > + * max loaded CPU below the average load. At the same time, we also
> > + * don't want to reduce the group load below the group capacity.
> > + * Thus we look for the minimum possible imbalance.
> > */
> > - if (env->imbalance < busiest->load_per_task)
> > - return fix_small_imbalance(env, sds);
> > + env->src_grp_type = migrate_load;
> > + env->imbalance = min(
> > + (busiest->avg_load - sds->avg_load) * busiest->group_capacity,
> > + (sds->avg_load - local->avg_load) * local->group_capacity
> > + ) / SCHED_CAPACITY_SCALE;
> > }
>
> We calculated avg_load for !group_overloaded case, but seem to be using for
> group_overloaded cases too.
for group_overloaded case, we already computed it in update_sg_lb_stats()
>
>
> --
> Thanks and Regards
> Srikar Dronamraju
>
Powered by blists - more mailing lists