linux-kernel - Re: [PATCH 3/5] sched/fair: rework load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtBp+mk6+O3qczwmJ=0Xs71t0jaORqOxj8kgH-CKGZxFUg@mail.gmail.com>
Date:   Wed, 31 Jul 2019 17:37:37 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Phil Auld <pauld@...hat.com>
Subject: Re: [PATCH 3/5] sched/fair: rework load_balance

On Wed, 31 Jul 2019 at 15:44, Srikar Dronamraju
<srikar@...ux.vnet.ibm.com> wrote:
>
> * Vincent Guittot <vincent.guittot@...aro.org> [2019-07-26 16:42:53]:
>
> > On Fri, 26 Jul 2019 at 15:59, Srikar Dronamraju
> > <srikar@...ux.vnet.ibm.com> wrote:
> > > > @@ -7361,19 +7357,46 @@ static int detach_tasks(struct lb_env *env)
> > > >               if (!can_migrate_task(p, env))
> > > >                       goto next;
> > > >
> > > > -             load = task_h_load(p);
> > > > +             if (env->src_grp_type == migrate_load) {
> > > > +                     unsigned long load = task_h_load(p);
> > > >
> > > > -             if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> > > > -                     goto next;
> > > > +                     if (sched_feat(LB_MIN) &&
> > > > +                         load < 16 && !env->sd->nr_balance_failed)
> > > > +                             goto next;
> > > > +
> > > > +                     if ((load / 2) > env->imbalance)
> > > > +                             goto next;
> > >
> > > I know this existed before too but if the load is exactly or around 2x of
> > > env->imbalance, the resultant imbalance after the load balance operation
> > > would still be around env->imbalance. We may lose some cache affinity too.
> > >
> > > Can we do something like.
> > >                 if (2 * load > 3 * env->imbalance)
> > >                         goto next;
> >
> > TBH, I don't know what should be the best value and it's probably
> > worth doing some investigation but i would prefer to do that as a
> > separate patch to get a similar behavior in the overloaded case
> > Why do you propose 3/2 instead of 2 ?
> >
>
> If the imbalance is exactly or around load/2, then we still select the task to migrate
> However after the migrate the imbalance will still be load/2.
> - Can this lead to ping/pong?

In some case you're probably right but this might also help to move
other tasks when  3/2 will not.

1st example with 3 tasks with almost same load on 2 cpus, we will
probably end up migrating the waiting task with a load  / 2 ~<
env->imbalance whereas 2 * load > 3 * env->imbalance will not move any
task.
Note that we don't ensure fairness between the 3 tasks in the latter
case. TBH I don't know what is better

2nd example with 2 tasks TA and TB  with a load around L and 4 tasks
T0 T1 T2 T3 with a load around L/4
TA and TB are on CPU0 and T0 to T3 are on CPU1, we have the same
imbalance as previous example 2L on CPU0 and L on CPU1.
With load / 2 > env->imbalance, we will migrate TA or TB to CPU1 and
then, we might end up migrating 2 tasks among T0 to T3 and the system
will be balanced
With 2 * load > 3 * env->imbalance, we will not migrate TA or TB
because task load is higher the the imbalance which is L/2 and the
scheduler will never get a chance to balance the system whereas it
could have reached a balanced state

Just to say that I'm not sure that there one best ratio to use

> - Did we lose out of cache though we didn't gain from an imbalance.
>
> > > >
> > > >
> > > > -     if (sgs->sum_h_nr_running)
> > > > -             sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
> > > > +     sgs->group_capacity = group->sgc->capacity;
> > > >
> > > >       sgs->group_weight = group->group_weight;
> > > >
> > > > -     sgs->group_no_capacity = group_is_overloaded(env, sgs);
> > > > -     sgs->group_type = group_classify(group, sgs);
> > > > +     sgs->group_type = group_classify(env, group, sgs);
> > > > +
> > > > +     /* Computing avg_load makes sense only when group is overloaded */
> > > > +     if (sgs->group_type != group_overloaded)
> > > > +             sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
> > > > +                             sgs->group_capacity;
> > >
> > > Mismatch in comment and code?
> >
> > I may need to add more comments but at this step, the group should be
> > either overloaded or fully busy but it can also be imbalanced.
> > In case of a group fully busy or imbalanced (sgs->group_type !=
> > group_overloaded), we haven't computed avg_load yet so we have to do
> > so because:
> > -In the case of fully_busy, we are going to be overloaded which the
> > next step after fully busy when you are about to pull more load
> > -In case of imbalance, we don't know the real state of the local group
> > so we fall back to this default behavior
> >
>
> We seem to be checking for avg_load when the group_type is group_overloaded.
> But somehow I am don't see where sgs->avg_load is calculated for
> group_overloaded case.

My fault, I read to quickly your comment and thought that you were
referring to calculte_imbalance() but your comment is about
update_sg_lb_stats().
As Peter mentioned previously this should be
  if (sgs->group_type == group_overloaded)
instead of
  if (sgs->group_type != group_overloaded)

>
> > >
> > > We calculated avg_load for !group_overloaded case, but seem to be using for
> > > group_overloaded cases too.
> >
> > for group_overloaded case, we already computed it in update_sg_lb_stats()
> >
>
>
> From update_sg_lb_stats()
>
>         if (sgs->group_type != group_overloaded)
>                 sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) /
>                                 sgs->group_capacity;
>
> So we seem to be skipping calculation of avg_load for group_overloaded. No?

yes.

Thanks
Vincent

>
>
> --
> Thanks and Regards
> Srikar Dronamraju
>