linux-kernel - Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24bc804e-305f-4273-922a-a24070aa3e56@arm.com>
Date: Wed, 12 Mar 2025 15:08:51 +0100
From: Pierre Gondois <pierre.gondois@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io, hongyan.xia2@....com, christian.loehle@....com,
 luis.machado@....com, qperret@...gle.com
Subject: Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of
 spare capacity

Hello Vincent,

On 3/2/25 22:05, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs. The cost of the OPP remains the only
> comparison criteria between Performance Domains.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---
>   kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>   1 file changed, 246 insertions(+), 220 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3d1a2ba6b1a..a9b97bbc085f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[...]

> +static bool update_best_cpu(struct energy_cpu_stat *target,
> +			    struct energy_cpu_stat *min,
> +			    int prev, struct sched_domain *sd)
>   {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> -
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +	/*  Select the one with the least number of running tasks */
> +	if (target->nr_running < min->nr_running)
> +		return true;
> +	if (target->nr_running > min->nr_running)
> +		return false;
>   
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> +	/* Favor previous CPU otherwise */
> +	if (target->cpu == prev)
> +		return true;
> +	if (min->cpu == prev)
> +		return false;
>   
> -	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> +	/*
> +	 * Choose CPU with lowest contention. One might want to consider load
> +	 * instead of runnable but we are supposed to not be overutilized so
> +	 * there is enough compute capacity for everybody.
> +	 */

I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
compute capacity while not being tagged as overutilized. IIUC this is actually
the goal of UCLAMP_MAX.

With the following workload:
- 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
- 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10
The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
little CPUs.

select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
end up being placed on one little CPU, and the 2 tasks B are placed on the
other little CPU, feec() is theoretically unable to balance the workload.
In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
are shuffled, so the pattern breaks after ~30ms.

This pattern seems problematic as tasks A are:
- smaller (30% < 70%)
- nicer (0 > -10)
than tasks B. So I assume the correct task placement should be one task of each
type on each little CPU.

------

There are some comments in the load balancer code:
1.
/* Computing avg_load makes sense only when group is overloaded */
2.
/*
* Computing avg_load makes sense only when group is fully busy or
* overloaded
*/

IIUC, the load is only meaningful when there is not enough compute capacity
to estimate the task size, otherwise util_avg makes more sense. It seems that
when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
load_avg makes more sense that util_avg.
However, in this situation, energy computations also lose sense since they
are based on the util_avg values.

------

select_best_cpu() could check the CPU load before checking nr_running, but it
would be meaningless if there is enough CPU time for all the tasks.

Maybe CPU load should then be checked only if the system doesn't have enough
CPU time. But this would be equivalent to:
- remove UCLAMP_MAX in cpu_overutilized()
- when the system is overutilized (no UCLAMP_MAX involved), go back to the
   load balancer

In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
rely on load_avg value rather than util_avg.

Regards,
Pierre


> +	if ((target->runnable * min->capa * sd->imbalance_pct) >=
> +			(min->runnable * target->capa * 100))
> +		return false;
>   
> -	return energy;
> +	return true;
>   }