linux-kernel - Re: [PATCH 3/4] sched/fair: Remove task_util from effective utilization in feec()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0cdf2cf-d098-627f-3a43-7d36f73a4c8f@arm.com>
Date:   Tue, 4 Jan 2022 18:27:34 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vincent Donnefort <vincent.donnefort@....com>,
        peterz@...radead.org, mingo@...hat.com, vincent.guittot@...aro.org
Cc:     linux-kernel@...r.kernel.org, valentin.schneider@....com,
        morten.rasmussen@....com, chris.redpath@....com,
        qperret@...gle.com, lukasz.luba@....com
Subject: Re: [PATCH 3/4] sched/fair: Remove task_util from effective
 utilization in feec()

On 09/12/2021 17:11, Vincent Donnefort wrote:
> The energy estimation in find_energy_efficient_cpu() (feec()) relies on
> the computation of the effective utilization for each CPU of a perf domain
> (PD). This effective utilization is then used as an estimation of the busy
> time for this pd. The function effective_cpu_util() which gives this value,
> scales the utilization relative to IRQ pressure on the CPU to take into
> account that the IRQ time is hidden from the task clock. The IRQ scaling is
> as follow:
> 
>    effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util

effective_cpu_util stands for cpu_util (effective_cpu_util(...,
ENERGY_UTIL, ...),  right? (1)

> Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
> the CPU and irq the IRQ avg time.
> 
> If now we take as an example a task placement which doesn't raise the OPP
> on the candidate CPU, we can write the energy delta as:
> 
>   delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
>                              effective_cpu_util(cpu_util))
>         = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
> 
> We end-up with an energy delta depending on the IRQ avg time, which is a
> problem: first the time spent on IRQs by a CPU has no effect on the
> additional energy that would be consumed by a task. Second, we don't want
> to favour a CPU with a higher IRQ avg time value.
> 
> Nonetheless, we need to take the IRQ avg time into account. If a task
> placement raises the PD's frequency, it will increase the energy cost for
> the entire time where the CPU is busy. A solution is to only use
> effective_cpu_util() with the CPU contribution part. The task contribution
> is added separately and scaled according to prev_cpu's IRQ time.

This whole idea looks like a follow-up of commit 0372e1cf70c2
("sched/fair: Fix task utilization accountability in compute_energy()").

I forgot why we still use cpu_util_next(cpu, p, dst_cpu) for
FREQUENCY_UTIL though?

> No change for the FREQUENCY_UTIL component of the energy estimation. We

OK, it indirectly says so. (1)

[...]

> @@ -6599,23 +6599,83 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
>  	return min(util, capacity_orig_of(cpu));
>  }
>  
> +/*
> + * Compute the task busy time for compute_energy(). This time cannot be
> + * injected directly into effective_cpu_util() because of the IRQ scaling.
> + * The latter only makes sense with the most recent CPUs where the task has
> + * run.
> + */
> +static inline unsigned long
> +task_busy_time(struct task_struct *p, int prev_cpu)
> +{
> +	unsigned long cpu_cap = arch_scale_cpu_capacity(prev_cpu);

s/cpu_cap/max_cap ... to stay consistent

> +	unsigned long irq = cpu_util_irq(cpu_rq(prev_cpu));

What about irq >= max_cap ?

effective_cpu_util() has the following condition:

if (unlikely(irq >= max_cap))
    return max_cap;

[...]

> + * The contribution of the task @p for which we want to estimate the energy
> + * cost is removed (by cpu_util_next()) and must be compted separalyte (see

s/compted separalyte/calculated separately ?

[...]

> +static inline unsigned long
> +pd_task_busy_time(struct task_struct *p, struct perf_domain *pd,
> +		  unsigned long cpu_cap, unsigned long tsk_busy_time,
> +		  unsigned long *pd_tsk_busy_time)
> +{
> +	unsigned long max_cap, pd_cap = 0, pd_busy_time = 0;
> +	struct cpumask *pd_mask = perf_domain_span(pd);
> +	int cpu;
> +
> +	max_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));

In case we use 'sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()'

https://gitlab.arm.com/linux-arm/linux-de/-/commit/150e753e861285e82e9d7c593f1f26075c34e124

we would not have to have max_cap for effective_cpu_util(). This would
make the code easier to understand since we already have to pass pd_cap
and cpu_cap (cpu_thermal_cap) now.

> +
> +	/* see compute_energy() */
> +	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {

Somehow unrelated ... We now use cpu_online_mask 3 times per PD to
iterate over the CPUs. cpu_online_mask can change during the run-queue
selection.

We could reuse `select_idle_mask` to create one cpumask at the beginning
of feec() and pass it down the fucntions:

See `sched/fair: Rename select_idle_mask to select_rq_mask` and
`sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/ec5bc27a298dd1352dbaff5809743128fa351075

https://gitlab.arm.com/linux-arm/linux-de/-/commit/f73b19968a65b07b0ad5bd1dff721ed1a675a24b

> +		unsigned long util = cpu_util_next(cpu, p, -1);
> +
> +		pd_cap += cpu_cap;
> +		pd_busy_time += effective_cpu_util(cpu, util, max_cap,
> +						     ENERGY_UTIL, NULL);
> +	}
> +
> +	pd_busy_time = min(pd_cap, pd_busy_time);
> +	*pd_tsk_busy_time = min(pd_cap, pd_busy_time + tsk_busy_time);

We do `sum_util += min(cpu_util, _cpu_cap (cpu_thermal_cap))` in
compute_energy() so far but now you sum up PD capacity (pd_cap) first
and then compare the sum_util (i.e. pd_busy_time) with pd_cap. Why?

cpu_util = effective_cpu_util(..., FREQUENCY_UTIL, ...) still uses
min(cpu_util, _cpu_cap).

> +
> +	return pd_busy_time;

This function seems to be a little bit overloaded. It's called
pd_task_busy_time but returns `pd` busy time and `pd + task` busy time.

In case we could calculate pd_cap pd_task_busy_time() then this function
can only return pd_busy_time and pd_tsk_busy_time could also calculate
outside the function.

See `XXX: Calculate pd_cap & pd_tsk_busy_time outside pd_task_busy_time()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/4512d681fe32eb5e2862c4c5d5a03b1d84129e26

[...]

> @@ -6628,34 +6688,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  	 */
>  	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
>  		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> -		unsigned long cpu_util, util_running = util_freq;
> +		unsigned long cpu_util;
>  		struct task_struct *tsk = NULL;
>  
> -		/*
> -		 * When @p is placed on @cpu:
> -		 *
> -		 * util_running = max(cpu_util, cpu_util_est) +
> -		 *		  max(task_util, _task_util_est)
> -		 *
> -		 * while cpu_util_next is: max(cpu_util + task_util,
> -		 *			       cpu_util_est + _task_util_est)
> -		 */
> -		if (cpu == dst_cpu) {
> +		if (cpu == dst_cpu)
>  			tsk = p;
> -			util_running =
> -				cpu_util_next(cpu, p, -1) + task_util_est(p);
> -		}
> -
> -		/*
> -		 * Busy time computation: utilization clamping is not
> -		 * required since the ratio (sum_util / cpu_capacity)
> -		 * is already enough to scale the EM reported power
> -		 * consumption at the (eventually clamped) cpu_capacity.
> -		 */
> -		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> -					      ENERGY_UTIL, NULL);
> -
> -		sum_util += min(cpu_util, _cpu_cap);
>  
>  		/*
>  		 * Performance domain frequency: utilization clamping
> @@ -6664,12 +6701,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  		 * NOTE: in case RT tasks are running, by default the
>  		 * FREQUENCY_UTIL's utilization can be max OPP.
>  		 */
> -		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
> +		cpu_util = effective_cpu_util(cpu, util_freq, max_cap,
>  					      FREQUENCY_UTIL, tsk);
> -		max_util = max(max_util, min(cpu_util, _cpu_cap));
> +		max_util = max(max_util, min(cpu_util, cpu_cap));
>  	}

It's hard to understand since it is unbalanced that `busy time` is
calculated in pd_busy_time() whereas `max_util` is still calculated in
compute_energy().

IMHO it would be much easier to understand if there would be a
pd_max_util() as well so that we could do:

  busy_time = get_pd_busy_time()
  max_util = get_pd_max_util(..., -1, ...)
  base_energy = compute_energy(..., max_util, busy_time, ...)

  busy_time = min(busy_time + tsk_busy_time, pd_cap);
  if (compute_prev_delta)
      max_util = get_pd_max_util(..., prev_cpu, ...)
      prev_delta = compute_energy(..., max_util, busy_time, ...)
  ...

See `XXX: Split get_pd_max_util() from compute_energy()`

https://gitlab.arm.com/linux-arm/linux-de/-/commit/6d79929ee7c8675a127158187786dd1d6b6dd355

[...]

> @@ -6783,13 +6824,27 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  		if (max_spare_cap_cpu < 0 && !compute_prev_delta)
>  			continue;
>  
> +		/* Account thermal pressure for the energy estimation */
> +		cpu_thermal_cap = arch_scale_cpu_capacity(
> +					cpumask_first(perf_domain_span(pd)));
> +		cpu_thermal_cap -= arch_scale_thermal_pressure(
> +					cpumask_first(perf_domain_span(pd)));

Yes, we should calculate cpu_thermal_cap only once per PD. Can you make
this a little bit more easy to read by getting `cpu =
cpumask_first(perf_domain_span(pd));` first?

[...]