linux-kernel - Re: [PATCH 3/5] sched/fair: rework load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBE5fzycHk33rf7Bky-tYSeCXaKd9oGR5cgaghzbL2TvA@mail.gmail.com>
Date:   Fri, 26 Jul 2019 14:30:37 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Phil Auld <pauld@...hat.com>
Subject: Re: [PATCH 3/5] sched/fair: rework load_balance

On Fri, 26 Jul 2019 at 12:41, Valentin Schneider
<valentin.schneider@....com> wrote:
>
> On 26/07/2019 10:01, Vincent Guittot wrote:
> >> Huh, interesting. Why go for utilization?
> >
> > Mainly because that's what is used to detect a misfit task and not the load
> >
> >>
> >> Right now we store the load of the task and use it to pick the "biggest"
> >> misfit (in terms of load) when there are more than one misfit tasks to
> >> choose:
> >
> > But having a big load doesn't mean that you have a big utilization
> >
> > so you can trig the misfit case because of task A with a big
> > utilization that doesn't fit to its local cpu. But then select a task
> > B in detach_tasks that has a small utilization but a big weight and as
> > a result a higher load
> > And task B will never trig the misfit UC by itself and should not
> > steal the pulling opportunity of task A
> >
>
> We can avoid this entirely by going straight for an active balance when
> we are balancing misfit tasks (which we really should be doing TBH).

but your misfit task might not be the running one anymore when
load_balance effectively happens

>
> If we *really* want to be surgical about misfit migration, we could track
> the task itself via a pointer to its task_struct, but IIRC Morten

I thought about this but task can have already die at that time and
the pointer is no more relevant.
Or we should parse the list of task still attached to the cpu and
compare them with the saved pointer but then it's not scalable and
will consume a lot of time

> purposely avoided this due to all the fun synchronization issues that
> come with it.
>
> With that out of the way, I still believe we should maximize the migrated
> load when dealing with several misfit tasks - there's not much else you can
> look at anyway to make a decision.

But you can easily select a task that is not misfit so what is the best/worst ?
select a fully wrong task or at least one of the real misfit tasks

I'm fine to go back and use load instead of util but it's not robust IMO.

>
> It sort of makes sense when e.g. you have two misfit tasks stuck on LITTLE
> CPUs and you finally have a big CPU being freed, it would seem fair to pick
> the one that's been "throttled" the longest - at equal niceness, that would
> be the one with the highest load.
>
> >>
> >> update_sd_pick_busiest():
> >> ,----
> >> | /*
> >> |  * If we have more than one misfit sg go with the biggest misfit.
> >> |  */
> >> | if (sgs->group_type == group_misfit_task &&
> >> |     sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> >> |       return false;
> >> `----
> >>
> >> I don't think it makes much sense to maximize utilization for misfit tasks:
> >> they're over the capacity margin, which exactly means "I can't really tell
> >> you much on that utilization other than it doesn't fit".
> >>
> >> At the very least, this rq field should be renamed "misfit_task_util".
> >
> > yes. I agree that i should rename the field
> >
> >>
> >> [...]
> >>
> >>> @@ -7060,12 +7048,21 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >>>  enum fbq_type { regular, remote, all };
> >>>
> >>>  enum group_type {
> >>> -     group_other = 0,
> >>> +     group_has_spare = 0,
> >>> +     group_fully_busy,
> >>>       group_misfit_task,
> >>> +     group_asym_capacity,
> >>>       group_imbalanced,
> >>>       group_overloaded,
> >>>  };
> >>>
> >>> +enum group_migration {
> >>> +     migrate_task = 0,
> >>> +     migrate_util,
> >>> +     migrate_load,
> >>> +     migrate_misfit,
> >>
> >> Can't we have only 3 imbalance types (task, util, load), and make misfit
> >> fall in that first one? Arguably it is a special kind of task balance,
> >> since it would go straight for the active balance, but it would fit a
> >> `migrate_task` imbalance with a "go straight for active balance" flag
> >> somewhere.
> >
> > migrate_misfit uses its own special condition to detect the task that
> > can be pulled compared to the other ones
> >
>
> Since misfit is about migrating running tasks, a `migrate_task` imbalance
> with a flag that goes straight to active balancing should work, no?

see my previous comment

>
> [...]
> >> Rather than filling the local group, shouldn't we follow the same strategy
> >> as for load, IOW try to reach an average without pushing local above nor
> >> busiest below ?
> >
> > But we don't know if this will be enough to make the busiest group not
> > overloaded anymore
> >
> > This is a transient state:
> > a group is overloaded, another one has spare capacity
> > How to balance the system will depend of how much overload if in the
> > group and we don't know this value.
> > The only solution is to:
> > - try to pull as much task as possible to fill the spare capacity
> > - Is the group still overloaded ? use avg_load to balance the system
> > because both group will be overloaded
> > - Is the group no more overloaded ? balance the number of idle cpus
> >
> >>
> >> We could build an sds->avg_util similar to sds->avg_load.
> >
> > When there is spare capacity, we balances the number of idle cpus
> >
>
> What if there is spare capacity but no idle CPUs? In scenarios like this
> we should balance utilization. We could wait for a newidle balance to

why should we balance anything ? all tasks have already enough running time.
It's better to wait for a cpu to become idle instead of trying to
predict which one will be idle first and migrate task uselessly
because other tasks can easily wakeup in the meantime

> happen, but it'd be a shame to repeatedly do this when we could
> preemptively balance utilization.
>