linux-kernel - Re: [PATCH v2 4/8] sched/fair: rework load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAnHSZ9Lb+JUktA8Z_90V9egzU=M5ErrE=PUGy8qUWLBQ@mail.gmail.com>
Date:   Thu, 29 Aug 2019 16:26:49 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Phil Auld <pauld@...hat.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>
Subject: Re: [PATCH v2 4/8] sched/fair: rework load_balance

On Wed, 28 Aug 2019 at 16:19, Valentin Schneider
<valentin.schneider@....com> wrote:
>
> On 26/08/2019 11:11, Vincent Guittot wrote:
> >>> +     case group_fully_busy:
> >>> +             /*
> >>> +              * Select the fully busy group with highest avg_load.
> >>> +              * In theory, there is no need to pull task from such
> >>> +              * kind of group because tasks have all compute
> >>> +              * capacity that they need but we can still improve the
> >>> +              * overall throughput by reducing contention when
> >>> +              * accessing shared HW resources.
> >>> +              * XXX for now avg_load is not computed and always 0 so
> >>> +              * we select the 1st one.
> >>> +              */
> >>
> >> What's wrong with unconditionally computing avg_load in update_sg_lb_stats()?
> >
> > removing useless division which can be expensive
> >
>
> Seeing how much stuff we already do in just computing the stats, do we
> really save that much by doing this? I'd expect it to be negligible with
> modern architectures and all of the OoO/voodoo, but maybe I need a
> refresher course.

We are not only running on top/latest architecture

>
> >> We already unconditionally accumulate group_load anyway.
> >
> > accumulation must be done while looping on the group whereas computing
> > avg_load can be done only when needed
> >
> >>
> >> If it's to make way for patch 6/8 (using load instead of runnable load),
> >> then I think you are doing things in the wrong order. IMO in this patch we
> >> should unconditionally compute avg_load (using runnable load), and then
> >> you could tweak it up in a subsequent patch.
> >
> > In fact, it's not link to patch 6/8.
> > It's only that I initially wanted to used load only when overloaded
> > but then I got this case and thought that comparing avg_load could be
> > interesting but without any proof that it's worth.
> > As mentioned in the comment, tasks in this group have enough capacity
> > and there is no need to move task in theory. This is there mainly to
> > trigger the discuss and keep in mind a possible area of improvement in
> > a next step.
> > I haven't run tests or done more study on this particular case to make
> > sure that there would be some benefit to compare avg_load.
> >
> > So in the future, we might end up always computing avg_load and
> > comparing it for selecting busiest fully busy group
> >
>
> Okay, that definitely wants testing then.
>
> [...]
> >>> +     if (busiest->group_type == group_misfit_task) {
> >>> +             /* Set imbalance to allow misfit task to be balanced. */
> >>> +             env->balance_type = migrate_misfit;
> >>> +             env->imbalance = busiest->group_misfit_task_load;
> >>
> >> AFAICT we don't ever use this value, other than setting it to 0 in
> >> detach_tasks(), so what we actually set it to doesn't matter (as long as
> >> it's > 0).
> >
> > not yet.
> > it's only in patch 8/8 that we check if the tasks fits the cpu's
> > capacity during the detach_tasks
> >
>
> But that doesn't use env->imbalance, right? With that v3 patch it's just
> the task util's, so AFAICT my comment still stands.

no, misfit case keeps using load and imbalance like the current
implementation in this patch.
The modifications on the way to handle misfit task are all in patch 8

>
> >>
> >> I'd re-suggest folding migrate_misfit into migrate_task, which is doable if
> >> we re-instore lb_env.src_grp_type (or rather, not delete it in this patch),
> >> though it makes some other places somewhat uglier. The good thing is that
> >> it actually does end up being implemented as a special kind of task
> >> migration, rather than being loosely related.
> >
> > I prefer to keep it separate instead of adding a sub case in migrate_task
> >
>
> My argument here is that ideally they shouldn't be separated, since the misfit
> migration is a subcase of task migration (or an extension of it - in any
> case, they're related). I haven't found a nicer way to express it though,
> and I agree that the special casing in detach_tasks()/fbq()/etc is meh.
>
> [...]
> >>> @@ -8765,7 +8942,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >>>       env.src_rq = busiest;
> >>>
> >>>       ld_moved = 0;
> >>> -     if (busiest->cfs.h_nr_running > 1) {
> >>> +     if (busiest->nr_running > 1) {
> >>
> >> Shouldn't that stay h_nr_running ? We can't do much if those aren't CFS
> >> tasks.
> >
> > There is the case raised by srikar where we have for example 1 RT task
> > and 1 CFS task. cfs.h_nr_running==1 but we don't need active balance
> > because CFS is not the running task
> >
> >>
> >>>               /*
> >>>                * Attempt to move tasks. If find_busiest_group has found
> >>>                * an imbalance but busiest->nr_running <= 1, the group is
> >>>