linux-kernel - Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBO_DYM+HK+wWMvunMzDAzHwYghy4ae7GnyQHRe1A8kZg@mail.gmail.com>
Date: Fri, 14 Mar 2025 17:24:53 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Pierre Gondois <pierre.gondois@....com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com, 
	rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org, qyousef@...alina.io, 
	hongyan.xia2@....com, christian.loehle@....com, luis.machado@....com, 
	qperret@...gle.com
Subject: Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of
 spare capacity

On Wed, 12 Mar 2025 at 15:09, Pierre Gondois <pierre.gondois@....com> wrote:
>
> Hello Vincent,
>
> On 3/2/25 22:05, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs. The cost of the OPP remains the only
> > comparison criteria between Performance Domains.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> > ---
> >   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> >   1 file changed, 246 insertions(+), 220 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d3d1a2ba6b1a..a9b97bbc085f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [...]
>
> > +static bool update_best_cpu(struct energy_cpu_stat *target,
> > +                         struct energy_cpu_stat *min,
> > +                         int prev, struct sched_domain *sd)
> >   {
> > -     unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > -     unsigned long busy_time = eenv->pd_busy_time;
> > -     unsigned long energy;
> > -
> > -     if (dst_cpu >= 0)
> > -             busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> > +     /*  Select the one with the least number of running tasks */
> > +     if (target->nr_running < min->nr_running)
> > +             return true;
> > +     if (target->nr_running > min->nr_running)
> > +             return false;
> >
> > -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> > +     /* Favor previous CPU otherwise */
> > +     if (target->cpu == prev)
> > +             return true;
> > +     if (min->cpu == prev)
> > +             return false;
> >
> > -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> > +     /*
> > +      * Choose CPU with lowest contention. One might want to consider load
> > +      * instead of runnable but we are supposed to not be overutilized so
> > +      * there is enough compute capacity for everybody.
> > +      */
>
> I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
> compute capacity while not being tagged as overutilized. IIUC this is actually
> the goal of UCLAMP_MAX.

the uclamp_max says that the task doesn't need more than a compute
capacity of 1 so there is no lack

>
> With the following workload:
> - 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
> - 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10

Does the duty cycle make any difference here ? they won't be able to
run 30% or 70% anyway because of their uclamp_max, will they ?

> The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
> CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
> CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
> little CPUs.
>
> select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
> end up being placed on one little CPU, and the 2 tasks B are placed on the
> other little CPU, feec() is theoretically unable to balance the workload.

They will all have 80 compute capacity which is more than their uclamp_max= 1

> In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
> are shuffled, so the pattern breaks after ~30ms.
>
> This pattern seems problematic as tasks A are:
> - smaller (30% < 70%)
> - nicer (0 > -10)
> than tasks B. So I assume the correct task placement should be one task of each
> type on each little CPU.
>
> ------
>
> There are some comments in the load balancer code:
> 1.
> /* Computing avg_load makes sense only when group is overloaded */
> 2.
> /*
> * Computing avg_load makes sense only when group is fully busy or
> * overloaded
> */
>
> IIUC, the load is only meaningful when there is not enough compute capacity
> to estimate the task size, otherwise util_avg makes more sense. It seems that
> when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
> load_avg makes more sense that util_avg.
> However, in this situation, energy computations also lose sense since they
> are based on the util_avg values.
>
> ------
>
> select_best_cpu() could check the CPU load before checking nr_running, but it
> would be meaningless if there is enough CPU time for all the tasks.
>
> Maybe CPU load should then be checked only if the system doesn't have enough
> CPU time. But this would be equivalent to:
> - remove UCLAMP_MAX in cpu_overutilized()
> - when the system is overutilized (no UCLAMP_MAX involved), go back to the
>    load balancer
>
> In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
> tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
> rely on load_avg value rather than util_avg.
>
> Regards,
> Pierre
>
>
> > +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
> > +                     (min->runnable * target->capa * 100))
> > +             return false;
> >
> > -     return energy;
> > +     return true;
> >   }