linux-kernel - Re: [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of spare capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b23444d5-0482-4075-8cad-cae7e687d996@arm.com>
Date: Tue, 24 Dec 2024 16:46:53 +0000
From: Luis Machado <luis.machado@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io, hongyan.xia2@....com, pierre.gondois@....com,
 christian.loehle@....com, qperret@...gle.com
Subject: Re: [PATCH 3/7 v2] sched/fair: Rework feec() to use cost instead of
 spare capacity

Hi,

Just spotted a few things amidst going through the series to get a feel
for what it is trying to accomplish. Feel free to take or ignore the
suggestions, as they're mostly cosmetic anyway.

On 12/17/24 16:07, Vincent Guittot wrote:
> feec() looks for the CPU with highest spare capacity in a PD assuming that
> it will be the best CPU from a energy efficiency PoV because it will
> require the smallest increase of OPP. Although this is true generally
> speaking, this policy also filters some others CPUs which will be as
> efficients because of using the same OPP.
> In fact, we really care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result using the same energy cost. In
> these cases, we can use other metrics to select the best CPU for the same
> energy cost.
> 
> Rework feec() to look 1st for the lowest cost in a PD and then the most
> performant CPU between CPUs. The cost of the OPP remains the only
> comparison criteria between Performance Domains.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---
>  kernel/sched/fair.c | 463 +++++++++++++++++++++++---------------------
>  1 file changed, 241 insertions(+), 222 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 893eb6844642..cd046e8216a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8228,29 +8228,37 @@ unsigned long sched_cpu_util(int cpu)
>  }
>  
>  /*
> - * energy_env - Utilization landscape for energy estimation.
> - * @task_busy_time: Utilization contribution by the task for which we test the
> - *                  placement. Given by eenv_task_busy_time().
> - * @pd_busy_time:   Utilization of the whole perf domain without the task
> - *                  contribution. Given by eenv_pd_busy_time().
> - * @cpu_cap:        Maximum CPU capacity for the perf domain.
> - * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
> - */
> -struct energy_env {
> -	unsigned long task_busy_time;
> -	unsigned long pd_busy_time;
> -	unsigned long cpu_cap;
> -	unsigned long pd_cap;
> + * energy_cpu_stat - Utilization landscape for energy estimation.
> + * @idx :        Index of the OPP in the performance domain
> + * @cost :       Cost of the OPP
> + * @max_perf :   Compute capacity of OPP
> + * @min_perf :   Compute capacity of the previous OPP
> + * @capa :       Capacity of the CPU
> + * @runnable :   runnbale_avg of the CPU

Typo: runnbale/runnable

> + * @nr_running : number of cfs running task
> + * @fits :       Fits level of the CPU
> + * @cpu :        current best CPU
> + */
> +struct energy_cpu_stat {
> +	unsigned long idx;
> +	unsigned long cost;
> +	unsigned long max_perf;
> +	unsigned long min_perf;
> +	unsigned long capa;
> +	unsigned long util;
> +	unsigned long runnable;
> +	unsigned int nr_running;
> +	int fits;
> +	int cpu;
>  };
>  
>  /*
> - * Compute the task busy time for compute_energy(). This time cannot be
> - * injected directly into effective_cpu_util() because of the IRQ scaling.
> + * Compute the task busy time for computing its energy impact. This time cannot
> + * be injected directly into effective_cpu_util() because of the IRQ scaling.
>   * The latter only makes sense with the most recent CPUs where the task has
>   * run.
>   */
> -static inline void eenv_task_busy_time(struct energy_env *eenv,
> -				       struct task_struct *p, int prev_cpu)
> +static inline unsigned long task_busy_time(struct task_struct *p, int prev_cpu)
>  {
>  	unsigned long busy_time, max_cap = arch_scale_cpu_capacity(prev_cpu);
>  	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));
> @@ -8260,124 +8268,150 @@ static inline void eenv_task_busy_time(struct energy_env *eenv,
>  	else
>  		busy_time = scale_irq_capacity(task_util_est(p), irq, max_cap);
>  
> -	eenv->task_busy_time = busy_time;
> +	return busy_time;
>  }
>  
> -/*
> - * Compute the perf_domain (PD) busy time for compute_energy(). Based on the
> - * utilization for each @pd_cpus, it however doesn't take into account
> - * clamping since the ratio (utilization / cpu_capacity) is already enough to
> - * scale the EM reported power consumption at the (eventually clamped)
> - * cpu_capacity.
> - *
> - * The contribution of the task @p for which we want to estimate the
> - * energy cost is removed (by cpu_util()) and must be calculated
> - * separately (see eenv_task_busy_time). This ensures:
> - *
> - *   - A stable PD utilization, no matter which CPU of that PD we want to place
> - *     the task on.
> - *
> - *   - A fair comparison between CPUs as the task contribution (task_util())
> - *     will always be the same no matter which CPU utilization we rely on
> - *     (util_avg or util_est).
> - *
> - * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
> - * exceed @eenv->pd_cap.
> - */
> -static inline void eenv_pd_busy_time(struct energy_env *eenv,
> -				     struct cpumask *pd_cpus,
> -				     struct task_struct *p)
> +/* Estimate the utilization of the CPU that is then used to select the OPP */
> +static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int dst_cpu)
>  {
> -	unsigned long busy_time = 0;
> -	int cpu;
> +	unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> +	unsigned long eff_util, min, max;
> +
> +	/*
> +	 * Performance domain frequency: utilization clamping
> +	 * must be considered since it affects the selection
> +	 * of the performance domain frequency.
> +	 */
> +	eff_util = effective_cpu_util(cpu, util, &min, &max);
>  
> -	for_each_cpu(cpu, pd_cpus) {
> -		unsigned long util = cpu_util(cpu, p, -1, 0);
> +	/* Task's uclamp can modify min and max value */
> +	if (uclamp_is_used() && cpu == dst_cpu) {
> +		min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
>  
> -		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> +		/*
> +		 * If there is no active max uclamp constraint,
> +		 * directly use task's one, otherwise keep max.
> +		 */
> +		if (uclamp_rq_is_idle(cpu_rq(cpu)))
> +			max = uclamp_eff_value(p, UCLAMP_MAX);
> +		else
> +			max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
>  	}
>  
> -	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
> +	eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> +	return eff_util;
>  }
>  
> -/*
> - * Compute the maximum utilization for compute_energy() when the task @p
> - * is placed on the cpu @dst_cpu.
> - *
> - * Returns the maximum utilization among @eenv->cpus. This utilization can't
> - * exceed @eenv->cpu_cap.
> - */
> -static inline unsigned long
> -eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> -		 struct task_struct *p, int dst_cpu)
> +/* Estimate the utilization of the CPU without the task */
> +static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
>  {
> -	unsigned long max_util = 0;
> -	int cpu;
> +	unsigned long util = cpu_util(cpu, p, -1, 0);
> +	unsigned long eff_util;
>  
> -	for_each_cpu(cpu, pd_cpus) {
> -		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> -		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> -		unsigned long eff_util, min, max;
> +	eff_util = effective_cpu_util(cpu, util, NULL, NULL);
>  
> -		/*
> -		 * Performance domain frequency: utilization clamping
> -		 * must be considered since it affects the selection
> -		 * of the performance domain frequency.
> -		 * NOTE: in case RT tasks are running, by default the min
> -		 * utilization can be max OPP.
> -		 */
> -		eff_util = effective_cpu_util(cpu, util, &min, &max);
> +	return eff_util;
> +}
>  
> -		/* Task's uclamp can modify min and max value */
> -		if (tsk && uclamp_is_used()) {
> -			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
> +/* Find the cost of a performance domain for the estimated utilization */
> +static inline void find_pd_cost(struct em_perf_domain *pd,
> +				unsigned long max_util,
> +				struct energy_cpu_stat *stat)
> +{
> +	struct em_perf_table *em_table;
> +	struct em_perf_state *ps;
> +	int i;
>  
> -			/*
> -			 * If there is no active max uclamp constraint,
> -			 * directly use task's one, otherwise keep max.
> -			 */
> -			if (uclamp_rq_is_idle(cpu_rq(cpu)))
> -				max = uclamp_eff_value(p, UCLAMP_MAX);
> -			else
> -				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
> -		}
> +	/*
> +	 * Find the lowest performance state of the Energy Model above the
> +	 * requested performance.
> +	 */
> +	em_table = rcu_dereference(pd->em_table);
> +	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
> +	ps = &em_table->state[i];
>  
> -		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
> -		max_util = max(max_util, eff_util);
> +	/* Save the cost and performance range of the OPP */
> +	stat->max_perf = ps->performance;
> +	stat->cost = ps->cost;
> +	i = em_pd_get_previous_state(em_table->state, pd, i);
> +	if (i < 0)
> +		stat->min_perf = 0;
> +	else {
> +		ps = &em_table->state[i];
> +		stat->min_perf = ps->performance;
>  	}
> -
> -	return min(max_util, eenv->cpu_cap);
>  }
>  
> -/*
> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> - * contribution is ignored.
> - */
> -static inline unsigned long
> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> +/*Check if the CPU can handle the waking task */
> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>  {
> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> -	unsigned long busy_time = eenv->pd_busy_time;
> -	unsigned long energy;
> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> +	unsigned long util_min = p_util_min;
> +	unsigned long util_max = p_util_max;
> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
> +	struct rq *rq = cpu_rq(cpu);
>  
> -	if (dst_cpu >= 0)
> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> +	/*
> +	 * Skip CPUs that cannot satisfy the capacity request.
> +	 * IOW, placing the task there would make the CPU
> +	 * overutilized. Take uclamp into account to see how
> +	 * much capacity we can get out of the CPU; this is
> +	 * aligned with sched_cpu_util().
> +	 */
> +	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> +		unsigned long rq_util_min, rq_util_max;
> +		/*
> +		 * Open code uclamp_rq_util_with() except for
> +		 * the clamp() part. I.e.: apply max aggregation
> +		 * only. util_fits_cpu() logic requires to
> +		 * operate on non clamped util but must use the
> +		 * max-aggregated uclamp_{min, max}.
> +		 */
> +		rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> +		rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +		util_min = max(rq_util_min, p_util_min);
> +		util_max = max(rq_util_max, p_util_max);
> +	}
> +	return util_fits_cpu(util, util_min, util_max, cpu);
> +}
> +
> +/* For a same cost, select the CPU that will povide best performance for the task */

s/For a same cost/For the same cost
s/povide/provide

> +static bool select_best_cpu(struct energy_cpu_stat *target,
> +			    struct energy_cpu_stat *min,
> +			    int prev, struct sched_domain *sd)
> +{
> +	/*  Select the one with the least number of running tasks */
> +	if (target->nr_running < min->nr_running)
> +		return true;
> +	if (target->nr_running > min->nr_running)
> +		return false;

Reading through the above, it seems obvious what to do for the cases where we're below
and above min->nr_running. Then the only case for which we want to do additional checks is
when target->nr_running == min->nr_running. Would you mind adding a comment to that effect,
on why we need further processing for that case? I feel it will help clarify things.

>  
> -	energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
> +	/* Favor previous CPU otherwise */
> +	if (target->cpu == prev)
> +		return true;
> +	if (min->cpu == prev)
> +		return false;
>  
> -	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
> +	/*
> +	 * Choose CPU with lowest contention. One might want to consider load instead of
> +	 * runnable but we are supposed to not be overutilized so there is enough compute
> +	 * capacity for everybody.
> +	 */
> +	if ((target->runnable * min->capa * sd->imbalance_pct) >=
> +			(min->runnable * target->capa * 100))
> +		return false;
>  
> -	return energy;
> +	return true;
>  }

Might be just me, but the name of the function, select_best_cpu, seems to call for a
different answer than false/true. It seems to me it is trying to determine whether
target->cpu is the best cpu for the job at hand. Should we adjust the function name to
something a bit more appropriate? Maybe is_best_cpu?

>  
>  /*
>   * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
> - * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
> - * spare capacity in each performance domain and uses it as a potential
> - * candidate to execute the task. Then, it uses the Energy Model to figure
> - * out which of the CPU candidates is the most energy-efficient.
> + * waking task. find_energy_efficient_cpu() looks for the CPU with the lowest
> + * power cost (usually with maximum spare capacity but not always) in each
> + * performance domain and uses it as a potential candidate to execute the task.
> + * Then, it uses the Energy Model to figure out which of the CPU candidates is
> + * the most energy-efficient.
>   *
>   * The rationale for this heuristic is as follows. In a performance domain,
>   * all the most energy efficient CPU candidates (according to the Energy
> @@ -8414,17 +8448,14 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  {
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> -	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> -	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> -	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>  	struct root_domain *rd = this_rq()->rd;
> -	int cpu, best_energy_cpu, target = -1;
> -	int prev_fits = -1, best_fits = -1;
> -	unsigned long best_actual_cap = 0;
> -	unsigned long prev_actual_cap = 0;
> +	unsigned long best_nrg = ULONG_MAX;
> +	unsigned long task_util;
>  	struct sched_domain *sd;
>  	struct perf_domain *pd;
> -	struct energy_env eenv;
> +	int cpu, target = -1;
> +	int best_fits = -1;
> +	int best_cpu = -1;
>  
>  	rcu_read_lock();
>  	pd = rcu_dereference(rd->pd);
> @@ -8444,19 +8475,19 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	target = prev_cpu;
>  
>  	sync_entity_load_avg(&p->se);
> -	if (!task_util_est(p) && p_util_min == 0)
> -		goto unlock;
> -
> -	eenv_task_busy_time(&eenv, p, prev_cpu);
> +	task_util = task_busy_time(p, prev_cpu);
>  
>  	for (; pd; pd = pd->next) {
> -		unsigned long util_min = p_util_min, util_max = p_util_max;
> -		unsigned long cpu_cap, cpu_actual_cap, util;
> -		long prev_spare_cap = -1, max_spare_cap = -1;
> -		unsigned long rq_util_min, rq_util_max;
> -		unsigned long cur_delta, base_energy;
> -		int max_spare_cap_cpu = -1;
> -		int fits, max_fits = -1;
> +		unsigned long pd_actual_util = 0, delta_nrg = 0;
> +		unsigned long cpu_actual_cap, max_cost = 0;
> +		struct energy_cpu_stat target_stat;
> +		struct energy_cpu_stat min_stat = {
> +			.cost = ULONG_MAX,
> +			.max_perf = ULONG_MAX,
> +			.min_perf = ULONG_MAX,
> +			.fits = -2,
> +			.cpu = -1,
> +		};
>  
>  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>  
> @@ -8467,13 +8498,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  		cpu = cpumask_first(cpus);
>  		cpu_actual_cap = get_actual_cpu_capacity(cpu);
>  
> -		eenv.cpu_cap = cpu_actual_cap;
> -		eenv.pd_cap = 0;
> -
> +		/* In a PD, the CPU with the lowest cost will be the most efficient */
>  		for_each_cpu(cpu, cpus) {
> -			struct rq *rq = cpu_rq(cpu);
> -
> -			eenv.pd_cap += cpu_actual_cap;
> +			unsigned long target_perf;
>  
>  			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>  				continue;
> @@ -8481,120 +8508,112 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>  				continue;
>  
> -			util = cpu_util(cpu, p, cpu, 0);
> -			cpu_cap = capacity_of(cpu);
> +			target_stat.fits = check_cpu_with_task(p, cpu);
> +
> +			if (!target_stat.fits)
> +				continue;
> +
> +			/* 1st select the CPU that fits best */
> +			if (target_stat.fits < min_stat.fits)
> +				continue;
> +
> +			/* Then select the CPU with lowest cost */
> +
> +			/* Get the performance of the CPU w/ waking task. */
> +			target_perf = find_cpu_max_util(cpu, p, cpu);
> +			target_perf = min(target_perf, cpu_actual_cap);
> +
> +			/* Needing a higher OPP means a higher cost */
> +			if (target_perf > min_stat.max_perf)
> +				continue;
>  
>  			/*
> -			 * Skip CPUs that cannot satisfy the capacity request.
> -			 * IOW, placing the task there would make the CPU
> -			 * overutilized. Take uclamp into account to see how
> -			 * much capacity we can get out of the CPU; this is
> -			 * aligned with sched_cpu_util().
> +			 * At this point, target's cost can be either equal or
> +			 * lower than the current minimum cost.
>  			 */
> -			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> -				/*
> -				 * Open code uclamp_rq_util_with() except for
> -				 * the clamp() part. I.e.: apply max aggregation
> -				 * only. util_fits_cpu() logic requires to
> -				 * operate on non clamped util but must use the
> -				 * max-aggregated uclamp_{min, max}.
> -				 */
> -				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> -				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
>  
> -				util_min = max(rq_util_min, p_util_min);
> -				util_max = max(rq_util_max, p_util_max);
> -			}
> +			/* Gather more statistics */
> +			target_stat.cpu = cpu;
> +			target_stat.runnable = cpu_runnable(cpu_rq(cpu));
> +			target_stat.capa = capacity_of(cpu);
> +			target_stat.nr_running = cpu_rq(cpu)->cfs.h_nr_runnable;
>  
> -			fits = util_fits_cpu(util, util_min, util_max, cpu);
> -			if (!fits)
> +			/* If the target needs a lower OPP, then look up for
> +			 * the corresponding OPP and its associated cost.
> +			 * Otherwise at same cost level, select the CPU which
> +			 * provides best performance.
> +			 */
> +			if (target_perf < min_stat.min_perf)
> +				find_pd_cost(pd->em_pd, target_perf, &target_stat);
> +			else if (!select_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
>  				continue;
>  
> -			lsub_positive(&cpu_cap, util);
> -
> -			if (cpu == prev_cpu) {
> -				/* Always use prev_cpu as a candidate. */
> -				prev_spare_cap = cpu_cap;
> -				prev_fits = fits;
> -			} else if ((fits > max_fits) ||
> -				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> -				/*
> -				 * Find the CPU with the maximum spare capacity
> -				 * among the remaining CPUs in the performance
> -				 * domain.
> -				 */
> -				max_spare_cap = cpu_cap;
> -				max_spare_cap_cpu = cpu;
> -				max_fits = fits;
> -			}
> +			/* Save the new most efficient CPU of the PD */
> +			min_stat = target_stat;
>  		}
>  
> -		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> +		if (min_stat.cpu == -1)
>  			continue;
>  
> -		eenv_pd_busy_time(&eenv, cpus, p);
> -		/* Compute the 'base' energy of the pd, without @p */
> -		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> +		if (min_stat.fits < best_fits)
> +			continue;
>  
> -		/* Evaluate the energy impact of using prev_cpu. */
> -		if (prev_spare_cap > -1) {
> -			prev_delta = compute_energy(&eenv, pd, cpus, p,
> -						    prev_cpu);
> -			/* CPU utilization has changed */
> -			if (prev_delta < base_energy)
> -				goto unlock;
> -			prev_delta -= base_energy;
> -			prev_actual_cap = cpu_actual_cap;
> -			best_delta = min(best_delta, prev_delta);
> -		}
> +		/* Idle system costs nothing */
> +		target_stat.max_perf = 0;
> +		target_stat.cost = 0;
>  
> -		/* Evaluate the energy impact of using max_spare_cap_cpu. */
> -		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> -			/* Current best energy cpu fits better */
> -			if (max_fits < best_fits)
> -				continue;
> +		/* Estimate utilization and cost without p */
> +		for_each_cpu(cpu, cpus) {
> +			unsigned long target_util;
>  
> -			/*
> -			 * Both don't fit performance hint (i.e. uclamp_min)
> -			 * but best energy cpu has better capacity.
> -			 */
> -			if ((max_fits < 0) &&
> -			    (cpu_actual_cap <= best_actual_cap))
> -				continue;
> +			/* Accumulate actual utilization w/o task p */
> +			pd_actual_util += find_cpu_actual_util(cpu, p);
>  
> -			cur_delta = compute_energy(&eenv, pd, cpus, p,
> -						   max_spare_cap_cpu);
> -			/* CPU utilization has changed */
> -			if (cur_delta < base_energy)
> -				goto unlock;
> -			cur_delta -= base_energy;
> +			/* Get the max utilization of the CPU w/o task p */
> +			target_util = find_cpu_max_util(cpu, p, -1);
> +			target_util = min(target_util, cpu_actual_cap);
>  
> -			/*
> -			 * Both fit for the task but best energy cpu has lower
> -			 * energy impact.
> -			 */
> -			if ((max_fits > 0) && (best_fits > 0) &&
> -			    (cur_delta >= best_delta))
> +			/* Current OPP is enough */
> +			if (target_util <= target_stat.max_perf)
>  				continue;
>  
> -			best_delta = cur_delta;
> -			best_energy_cpu = max_spare_cap_cpu;
> -			best_fits = max_fits;
> -			best_actual_cap = cpu_actual_cap;
> +			/* Compute and save the cost of the OPP */
> +			find_pd_cost(pd->em_pd, target_util, &target_stat);
> +			max_cost = target_stat.cost;
>  		}
> -	}
> -	rcu_read_unlock();
>  
> -	if ((best_fits > prev_fits) ||
> -	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> -	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> -		target = best_energy_cpu;
> +		/* Add the NRG cost of p */
> +		delta_nrg = task_util * min_stat.cost;
>  
> -	return target;
> +		/* Compute the NRG cost of others running at higher OPP because of p */
> +		if (min_stat.cost > max_cost)
> +			delta_nrg += pd_actual_util * (min_stat.cost - max_cost);
> +
> +		/* nrg with p */
> +		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
> +				min_stat.max_perf, pd_actual_util + task_util);
> +
> +		/*
> +		 * The probability that delta NRGs are equals is almost null. PDs being sorted
> +		 * by max capacity, keep the one with highest max capacity if this
> +		 * happens.
> +		 * TODO: add a margin in nrg cost and take into account other stats
> +		 */
> +		if ((min_stat.fits == best_fits) &&
> +		    (delta_nrg >= best_nrg))
> +			continue;
> +
> +		best_fits = min_stat.fits;
> +		best_nrg = delta_nrg;

General comment. Maybe using nrg to mean energy is a known abbreviation in the area. If so,
feel free to ignore this. But it took me a bit to make the connection between the two, whereas
best_energy/delta_energy would've been a bit more clear.

> +		best_cpu = min_stat.cpu;
> +	}
>  
>  unlock:
>  	rcu_read_unlock();
>  
> +	if (best_cpu >= 0)
> +		target = best_cpu;
> +
>  	return target;
>  }
>