linux-kernel - Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDiNKpVWM4Tw2z+z6g+G1nf5SK5wbdsdnyAhAK5=q+OBg@mail.gmail.com>
Date:   Fri, 19 Mar 2021 16:19:36 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Qais Yousef <qais.yousef@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Quentin Perret <qperret@...gle.com>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        Rik van Riel <riel@...riel.com>,
        Lingutla Chandrasekhar <clingutla@...eaurora.org>
Subject: Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
<valentin.schneider@....com> wrote:
>
> On 15/03/21 16:13, Vincent Guittot wrote:
> > On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> > <valentin.schneider@....com> wrote:
> >>
> >> Consider the following (hypothetical) asymmetric CPU capacity topology,
> >> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> >>
> >>   DIE [          ]
> >>   MC  [    ][    ]
> >>        0  1  2  3
> >>
> >>   | CPU | capacity_orig | capacity |
> >>   |-----+---------------+----------|
> >>   |   0 |           870 |      860 |
> >>   |   1 |           870 |      600 |
> >>   |   2 |          1024 |      850 |
> >>   |   3 |          1024 |      860 |
> >>
> >> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> >> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> >> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> >> CPU1's misfit task. This would then fall on CPU2 to pull the task.
> >>
> >> This currently won't happen, because CPU2 will fail
> >>
> >>   capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> >
> > which has been introduced by the previous patch: patch5
> >
> >>
> >> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> >> level. In this case, the max_capacity is that of CPU0's, which is at this
> >> point in time greater than that of CPU2's. This comparison doesn't make
> >> much sense, given that the only CPUs we should care about in this scenario
> >> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> >> destination CPU).
> >>
> >> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> >> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> >> sched_group contains a misfit task to still classify it as
> >> group_misfit_task and not pick it as busiest group when pulling from a
> >
> > Could you give more details about why we should keep tracking the
> > group as misfit ? Do you have a UC in mind ?
> >
>
> As stated the current behaviour is to classify groups as group_misfit_task
> regardless of the dst_cpu's capacity. When we see a group_misfit_task
> candidate group misfit task with higher per-CPU capacity than the local
> group, we don't pick it as busiest.
>
> I initially thought not marking those as group_misfit_task was the right
> thing to do, as they could then be classified as group_fully_busy or
> group_has_spare. Consider:
>
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
>        L  L  B  B
>
>   arch_scale_capacity(L) < arch_scale_capacity(B)
>
>   CPUs 0-1 are idle / lightly loaded
>   CPU2 has a misfit task and a few very small tasks
>   CPU3 has a few very small tasks
>
> When CPU0 is running load_balance() at DIE level, right now we'll classify
> the [2-3] group as group_misfit_task and not pick it as busiest because the
> local group has a lower CPU capacity.
>
> If we didn't do that, we could leave the misfit task alone and pull some
> small task(s) from CPU2 or CPU3, which would be a good thing to

Are you sure? the last check in update_sd_pick_busiest() should
already filter this. So it should be enough to let it be classify
correctly

A group should be classified as group_misfit_task when there is a task
to migrate in priority compared to some other groups. In your case,
you tag it as group_misfit_task but in order to do the opposite, i.e.
make sure to not select it. As mentioned above, this will be filter in
the last check in update_sd_pick_busiest()

> do. However, by allowing a group containing a misfit task to be picked as
> the busiest group when a CPU of lower capacity is pulling, we run the risk
> of the misfit task itself being downmigrated - e.g. if we repeatedly
> increment the sd->nr_balance_failed counter and do an active balance (maybe
> because the small tasks were unfortunately cache_hot()).
>
> It's less than ideal, but I considered not downmigrating misfit tasks was
> the thing to prioritize (and FWIW it also maintains current behaviour).
>
>
> Another approach would be to add task utilization vs CPU capacity checks in
> detach_tasks() and need_active_balance() to prevent downmigration when
> env->imbalance_type < group_misfit_task. This may go against the busiest
> group selection heuristics however (misfit tasks could be the main
> contributors to the imbalance, but we end up not moving them).
>
>
> >> lower-capacity CPU (which is the current behaviour and prevents
> >> down-migration).
> >>
> >> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> >> than the local CPU's, add a capacity check there.
> >>
> >> Reviewed-by: Qais Yousef <qais.yousef@....com>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@....com>
>
> >> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                         continue;
> >>
> >>                 /* Check for a misfit task on the cpu */
> >> -               if (sd_has_asym_cpucapacity(env->sd) &&
> >> -                   sgs->group_misfit_task_load < rq->misfit_task_load) {
> >> -                       sgs->group_misfit_task_load = rq->misfit_task_load;
> >> -                       *sg_status |= SG_OVERLOAD;
> >> +               if (!sd_has_asym_cpucapacity(env->sd) ||
> >> +                   !rq->misfit_task_load)
> >> +                       continue;
> >> +
> >> +               *sg_status |= SG_OVERLOAD;
> >> +               sgs->group_has_misfit_task = true;
> >> +
> >> +               /*
> >> +                * Don't attempt to maximize load for misfit tasks that can't be
> >> +                * granted a CPU capacity uplift.
> >> +                */
> >> +               if (cpu_capacity_greater(env->dst_cpu, i)) {
> >> +                       sgs->group_misfit_task_load = max(
> >> +                               sgs->group_misfit_task_load,
> >> +                               rq->misfit_task_load);
> >
> > Please encapsulate all this misfit specific code in a dedicated
> > function which will be called from update_sg_lb_stats
> >
>
> Will do.