[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <681fb3e8-d645-2558-38de-b39b372499de@arm.com>
Date:   Thu, 16 Jul 2020 15:24:37 +0100
From:   Lukasz Luba <lukasz.luba@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Viresh Kumar <viresh.kumar@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Zhang Rui <rui.zhang@...el.com>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Daniel Kachhap <amit.kachhap@...il.com>,
        Javi Merino <javi.merino@...nel.org>,
        Amit Kucheria <amit.kucheria@...durent.com>,
        linux-kernel@...r.kernel.org, Quentin Perret <qperret@...gle.com>,
        Rafael Wysocki <rjw@...ysocki.net>, linux-pm@...r.kernel.org
Subject: Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()
Hi Peter,
Thank you for summarizing this. I've put my comments below.
On 7/16/20 12:56 PM, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 12:06:53PM +0530, Viresh Kumar wrote:
>>   /**
>> + * get_load() - get current load for a cpu
>>    * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
>>    * @cpu:	cpu number
>> + * @cpu_idx:	index of the cpu
>>    *
>> + * Return: The current load of cpu @cpu in percentage.
>>    */
>>   static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>>   		    int cpu_idx)
>>   {
>> +	unsigned long util = cpu_util_cfs(cpu_rq(cpu));
>> +	unsigned long max = arch_scale_cpu_capacity(cpu);
>>   
>> +	util = effective_cpu_util(cpu, util, max, ENERGY_UTIL, NULL);
>> +	return (util * 100) / max;
>>   }
> 
> So there's a number of things... let me recap a bunch of things that
> got mentioned on IRC earlier this week and then continue from there..
> 
> So IPA* (or any other thermal governor) needs energy estimates for the
> various managed devices, cpufreq_cooling, being the driver for the CPU
> device, needs to provide that and in return receives feedback on how
> much energy it is allowed to consume, cpufreq_cooling then dynamically
> enables/disables OPP states.
Currently, only IPA uses the power estimation, other governors don't
use these API functions in cpufreq_cooling.
> 
> There are actually two methods the thermal governor will use:
> get_real_power() and get_requested_power().
> 
> The first isn't used anywhere in mainline, but could be implemented on
> hardware that has energy counters (like say x86 RAPL).
The first is only present as callback for registered devfreq cooling,
which is registered by devfreq driver. If that driver provides the
get_real_power(), it will be called from get_requested_power().
Thus, it's likely that IPA would get real power value from HW.
I was planning to add it also to cpufreq_cooling callbacks years
ago...
> 
> The second attempts to guesstimate power, and is the subject of this
> patch.
> 
> Currently cpufreq_cooling appears to estimate the CPU energy usage by
> calculating the percentage of idle time using the per-cpu cpustat stuff,
> which is pretty horrific.
Even worse, it then *samples* the *current* CPU frequency at that
particular point in time and assumes that when the CPU wasn't idle
during that period - it had *this* frequency...
> 
> This patch then attempts to improve upon that by using the scheduler's
> cpu_util(ENERGY_UTIL) estimate, which is also used to select OPP state
> and improves upon avg idle. This should be a big improvement as higher
IMHO this patch set doesn't address the real problem: 'sampling
freq problem' described above. There was no issue with getting idle
period. The avg freq was the problem, in that period when the
CPUs were running. The model implemented in alg was also a problem.
The whole period (e.g. CPU freqs which were used or idle state)
^(CPU freq)
|
|                            sampling the current freq
|                _______        |
|               |      |        |
|________       |      |        |
|       |       |      |        |
|       | idle  |      |________v________...
|_ _____|_______|__________________________> (time)
   start of period               end
   |<------- (typically 100ms)-->|
> frequency consumes more energy, but should we not also consider that:
> 
> 	E = C V^2 f
> 
> The EAS energy model has tables for the OPPs that contain this, but in
> this case we seem to be assuming a linear enery/frequency curve, which
> is just not the case.
I am not sure if I got your point. To understand your point better
I think some drawing would be required. I will skip this patch
and old mainline code and focus on your proposed solution
(because this patch set does not address 'sampling freq problem').
> 
> I suppose we could do something like **:
> 
> 	100 * util^3 / max^3
> 
> which assumes V~f.
In EM we keep power values in the array and these values grow
exponentially. Each OPP has it corresponding
P_x = C (V_x)^2 f_x    , where x is the OPP id thus corresponding V,f
so we have discrete power values, growing like:
^(power)
|
|
|                          *
|
|
|                       *
|                       |
|                   *   |
|                       | <----- power estimation function
|            *          |        should not use linear 'util/max_util'
|   *                   |        relation here *
|_______________________|_____________> (freq)
    opp0     opp1  opp2 opp3 opp4
What is the problem
First:
We need to pick the right Power from the array. I would suggest
to pick the max allowed frequency for that whole period, because
we don't know if the CPUs were using it (it's likely).
Second:
Then we have the utilization, which can be considered as:
'idle period & running period with various freq inside', lets
call it avg performance in that whole period.
Third:
Try to estimate the power used in that whole period having
the avg performance and max performance.
What you are suggesting is to travel that [*] line in
non-linear fashion, but in (util^3)/(max_util^3). Which means
it goes down faster when the utilization drops.
I think it is too aggressive, e.g.
500^3 / 1024^3 = 0.116  <--- very little, ~12%
200^3 / 300^3  = 0.296
Peter could you confirm if I understood you correct?
This is quite important bit for me.
> 
> Another point is that cpu_util() vs turbo is a bit iffy, and to that,
> things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
> have the benefit of giving you values that match your own sampling
> interval (100ms), where the sched stuff is PELT (64,32.. based).
> 
> So what I've been thinking is that cpufreq drivers ought to be able to
> supply this method, and only when they lack, can the cpufreq-governor
> (schedutil) install a fallback. And then cpufreq-cooling can use
> whatever is provided (through the cpufreq interfaces).
> 
> That way, we:
> 
>   1) don't have to export anything
>   2) get arch drivers to provide something 'better'
> 
> 
> Does that sounds like something sensible?
> 
Yes, make sense. Please also keep in mind that this
utilization somehow must be mapped into power in a proper way.
I am currently working on addressing all of these problems
(including this correlation).
Thank you for your time spending on it and your suggestions.
Regards,
Lukasz
> 
> 
> 
> [*] I always want a beer when I see that name :-)
> 
> [**] I despise code that uses percentages, computers suck at
> /100 and there is no reason not to use any other random fraction, so why
> pick a bad one.
> 
Powered by blists - more mailing lists
 
