linux-kernel - Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63c3c6c3-cb16-4096-a067-ced25e916380@arm.com>
Date: Wed, 11 Sep 2024 18:51:50 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io, hongyan.xia2@....com
Subject: Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of
 spare capacity

(edit)

On 9/11/24 16:02, Pierre Gondois wrote:
> Hello Vincent,
> 
> On 8/30/24 15:03, Vincent Guittot wrote:
>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>> it will be the best CPU from a energy efficiency PoV because it will
>> require the smallest increase of OPP. Although this is true generally
>> speaking, this policy also filters some others CPUs which will be as
>> efficients because of using the same OPP.
>> In fact, we really care about the cost of the new OPP that will be
>> selected to handle the waking task. In many cases, several CPUs will end
>> up selecting the same OPP and as a result using the same energy cost. In
>> these cases, we can use other metrics to select the best CPU for the same
>> energy cost.
>>
>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>> performant CPU between CPUs.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>> ---
>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>    1 file changed, 244 insertions(+), 222 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e67d6029b269..2273eecf6086 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
> 
> [snip]
> 
>>    
>> -/*
>> - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
>> - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
>> - * contribution is ignored.
>> - */
>> -static inline unsigned long
>> -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>> -	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
>> +/*Check if the CPU can handle the waking task */
>> +static int check_cpu_with_task(struct task_struct *p, int cpu)
>>    {
>> -	unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
>> -	unsigned long busy_time = eenv->pd_busy_time;
>> -	unsigned long energy;
>> +	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
>> +	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>> +	unsigned long util_min = p_util_min;
>> +	unsigned long util_max = p_util_max;
>> +	unsigned long util = cpu_util(cpu, p, cpu, 0);
>> +	struct rq *rq = cpu_rq(cpu);
>>    
>> -	if (dst_cpu >= 0)
>> -		busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
> 
> I think you should mention that the energy computation is not capped anymore.
> It used to be:
> pd_util: sum of the CPU's util for the pd considered, without task_util
> pd_cap: sum of the CPU's capacity for the pd
> 
> (a)
> busy_time = min(pd_cap, pd_util);
> prev_energy = busy_time * OPP[prev_max_util].cost;
> 
> busy_time = min(pd_cap, pd_util + task_util);
> new_energy = busy_time * OPP[new_max_util].cost;
> 
> delta_energy = new_energy - prev_energy;
> 
> Note that when computing busy_time, task_util is not capped to one CPU's
> max_cap. This means that in empty pd, a task can 'steal' capacity from
> CPUs during the energy computation.
> Cf. [1]
> 
> and it is now:
> (b)
> delta_energy = task_util * OPP[new_max_util].cost;
> delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);
> 
> Note that the busy_time (task_util + pd_util) is now not capped by anything.
> 
> ---
> 
> Not capping the task_util is discussed in [1][3] and [2] (at 21:15).
> IIUC, UCLAMP_MAX tasks are the only case leveraging this. Indeed,
> non-clamped tasks will not fit and be placed on a bigger CPU. Same for
> UCLAMP_MIN tasks.
> FWIU, not capping the utilization of tasks during the energy computation
> allows to reflect that a task will (likely) run longer. However the
> task's performance would likely decrease as the other tasks on the target
> CPU are not taken into account (it is assumed the task to be placed will
> receive the compute power it requires).
> 
> ---
> Case A:
> 
> Assuming we have an empty system with:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256, cost=10])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512, cost=10])
> i.e. the first OPP of all the CPU consumes the same amount of energy.
> And a task with: [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> 
> Then feec() would have no reason to prefer a big CPU over a little CPU,
> even though the big CPU would provide more performance.
> 
> ---
> Case B:
> 
> (This is not especially related to this patch.)
> Another case that might be problematic:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512])
> - little CPUs consume less than big CPUs, but the highest OPP
>     of the little CPUs consume more than the lowest of the big CPUs.
> And tasks:
> - 3 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> - 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]
> 
> Then
> - the 3 UCLAMP_MAX tasks will be placed on the little CPUs. Indeed,
>     due to the last patch of the serie, these tasks have now an opportunity
>     to run feec() and be placed on a more energy efficient CPU.
> - the 'normal' task will be placed on a big CPU. Indeed, placing
>     it on a little CPU would raise the OPP of the little cluster.
> 
> This means that the 'normal' task is prevented to run the remaining little
> CPU even though:
> - the little CPU can provide the compute capacity

This behaviour is actually due to the little CPUs not being able to provide
the compute capacity for the normal task without raising the OPP of the cluster.
So this behaviour is expected.

I am providing another case instead:
Case B':
- 4 little CPUs (max_capa=512, first OPP as [capa=256])
- 2 big CPUs (max_capa=1024, first OPP as [capa=512])
And tasks:
- 4 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
- 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]

The normal task will not fit any of the little CPU as the rq's UCLAMP_MAX
value would raise from 10 to 1024. If I m not mistaken (this time), the
normal task should be placed on a little CPU as:
- it consumes less power
- even though UCLAMP_MAX tasks consume the least power they can, it makes
   other tasks consume more

Theoretically, placing the 4 UCLAMP_MAX tasks on one little CPU and using
another CPU for the normal task would:
- consume less energy
- satisfy the UCLAMP_MAX constraint
even though the performance of the workload would be less. This is a bit
hard to conceive for me.


> - the little CPU would consume less energy
> 
> In other terms, using UCLAMP_MAX on some tasks makes the system consume
> more energy.
> 
> ---
> 
> In my opinion, this last case comes from the difficulty of defining UCLAMP_MAX.
>   From sched-util-clamp.rst (about UCLAMP_MAX):
> - Uclamp is a hinting mechanism that allows the scheduler to understand the
>     performance requirements and restrictions of the tasks
> - The right way to view util clamp is as a mechanism to make request or hint on
>     performance constraints.
> - some tasks should be restricted from consuming too
>     much resources and should not go above a specific performance point.
> -
> Another example is in Android where tasks are classified as background,
> foreground, top-app, etc. Util clamp can be used to constrain how much
> resources background tasks are consuming by capping the performance point they
> can run at. This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group). Beside this
> helps in limiting how much power they consume. This can be more obvious in
> heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
> background tasks to stay on the little cores which will ensure that:
> 
>           1. The big cores are free to run top-app tasks immediately. top-app
>              tasks are the tasks the user is currently interacting with, hence
>              the most important tasks in the system.
>           2. They don't run on a power hungry core and drain battery even if they
>              are CPU intensive tasks.
> -
> For example, it can be handy to limit performance when running low on battery
> or when the system wants to limit access to more energy hungry performance
> levels when it's in idle state or screen is off.
> 
> """
> This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group).
> """
> It doesn't seem that UCLAMP_MAX does this. This looks more like bandwidth
> control.
> 
> """
> They don't run on a power hungry core and drain battery even if they
> are CPU intensive tasks.
> """
> Avoiding mid/big CPUs could be done with tasksets,
> 
> I can understand that one might want to run a task at a higher OPP due to
> timing constraints for instance. However I don't see why someone would want
> to run a task at a lower OPP, regarding only the performance and not the
> energy consumption. It thus means that UCLAMP_MAX is an energy hint rather
> of a performance hint.
> 
> UCLAMP_MAX could be set for a task to make it spend less energy, but the
> loss in performance would be unknown.
> A case Hongyan mentioned in his uclamp sum aggregation serie [4] is that
> an infinite task with [UCLAMP_MIN=0, UCLAMP_MAX=1, util = 1000] could fit
> in a little CPU. The energy consumption would indeed be very low, but the
> performance would also be very low.
> 
> With Hongyan's sum aggregation serie [5]:
> - case B would not happen as the 'normal' task would not raise the OPP of
>     the whole cluster.

Cf. above

> - the 'infinite UCLAMP_MAX tasks' case would not happen as each task would
>     account for 1 util
> - case A would still happen, but could be solved in any case.
> 
> I know Hongyan's patchset has already been discussed, but I still don't
> understand why max aggregation is preferred over sum aggregation.
> The definition of UCLAMP_MAX values seems clearer and in effect results in
> a simpler implementation and less corner cases. In simple words:
> "When estimating the CPU frequency to set, for this task,
> account for at most X util."
> rather than:
> "When estimating the CPU frequency to set, the task with the highest
> UCLAMP_MAX of the CPU will cap the requested frequency."
> 
> Note that actually I think it's a good idea to run feec() regularly
> and to take into account other parameters like nr_running. I just think
> that UCLAMP_MAX's max aggregation creates corner cases that are difficult
> to solve altogether.
> 
> Thanks in advance for the time you will take answering,
> Regards,
> Pierre
> 
> [1] https://lore.kernel.org/all/20240606070645.3295-1-xuewen.yan@unisoc.com/
> [2] https://www.youtube.com/watch?v=PHEBAyxeM_M
> [3] https://lore.kernel.org/all/CAKfTPtDPCPYvCi1c_Nh+Cn01ZVS7E=tAHQeNX-mArBt3BXdjYw@mail.gmail.com/
> [4] https://lore.kernel.org/all/b81a5b1c-14de-4232-bee9-ee647355dd8c@arm.com/
> [5] https://lore.kernel.org/all/cover.1706792708.git.hongyan.xia2@arm.com/#t
>