[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bcad0cfe-c96c-a933-6784-325f67d34c62@arm.com>
Date: Fri, 17 Jul 2020 10:55:36 +0100
From: Lukasz Luba <lukasz.luba@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Viresh Kumar <viresh.kumar@...aro.org>,
Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Zhang Rui <rui.zhang@...el.com>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Amit Daniel Kachhap <amit.kachhap@...il.com>,
Javi Merino <javi.merino@...nel.org>,
Amit Kucheria <amit.kucheria@...durent.com>,
linux-kernel@...r.kernel.org, Quentin Perret <qperret@...gle.com>,
Rafael Wysocki <rjw@...ysocki.net>, linux-pm@...r.kernel.org
Subject: Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()
On 7/16/20 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 03:24:37PM +0100, Lukasz Luba wrote:
>> On 7/16/20 12:56 PM, Peter Zijlstra wrote:
>
>>> The second attempts to guesstimate power, and is the subject of this
>>> patch.
>>>
>>> Currently cpufreq_cooling appears to estimate the CPU energy usage by
>>> calculating the percentage of idle time using the per-cpu cpustat stuff,
>>> which is pretty horrific.
>>
>> Even worse, it then *samples* the *current* CPU frequency at that
>> particular point in time and assumes that when the CPU wasn't idle
>> during that period - it had *this* frequency...
>
> *whee* :-)
>
> ...
>
>> In EM we keep power values in the array and these values grow
>> exponentially. Each OPP has it corresponding
>>
>> P_x = C (V_x)^2 f_x , where x is the OPP id thus corresponding V,f
>>
>> so we have discrete power values, growing like:
>>
>> ^(power)
>> |
>> |
>> | *
>> |
>> |
>> | *
>> | |
>> | * |
>> | | <----- power estimation function
>> | * | should not use linear 'util/max_util'
>> | * | relation here *
>> |_______________________|_____________> (freq)
>> opp0 opp1 opp2 opp3 opp4
>>
>> What is the problem
>> First:
>> We need to pick the right Power from the array. I would suggest
>> to pick the max allowed frequency for that whole period, because
>> we don't know if the CPUs were using it (it's likely).
>> Second:
>> Then we have the utilization, which can be considered as:
>> 'idle period & running period with various freq inside', lets
>> call it avg performance in that whole period.
>> Third:
>> Try to estimate the power used in that whole period having
>> the avg performance and max performance.
>>
>> What you are suggesting is to travel that [*] line in
>> non-linear fashion, but in (util^3)/(max_util^3). Which means
>> it goes down faster when the utilization drops.
>> I think it is too aggressive, e.g.
>> 500^3 / 1024^3 = 0.116 <--- very little, ~12%
>> 200^3 / 300^3 = 0.296
>>
>> Peter could you confirm if I understood you correct?
>
> Correct, with the caveat that we might try and regression fit a 3rd
> order polynomial to a bunch of EM data to see if there's a 'better'
> function to be had than a raw 'f(x) := x^3'.
I agree, I think we are on the same wavelength now.
>
>> This is quite important bit for me.
>
> So, if we assume schedutil + EM, we can actually have schedutil
> calculate a running power sum. That is, something like: \Int P_x dt.
> Because we know the points where OPP changes.
Yes, that's why I was thinking about having this information stored as a
copy inside the EM, then just read it in other subsystem like: thermal,
powercap, etc.
>
> Although, thinking more, I suspect we need tighter integration with
> cpuidle, because we don't actually have idle times here, but that should
> be doable.
I am scratching my head for while because of that idle issue. It opens
more dimensions to tackle.
>
> But for anything other than schedutil + EM, things become more
> interesting, because then we need to guesstimate power usage without the
> benefit of having actual power numbers.
Yes, from the engineering/research perspective, platforms which do not
have EM in Linux (like Intel) are also interesting.
>
> We can of course still do that running power sum, with whatever P(u) or
> P(f) end up with, I suppose.
>
>>> Another point is that cpu_util() vs turbo is a bit iffy, and to that,
>>> things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
>>> have the benefit of giving you values that match your own sampling
>>> interval (100ms), where the sched stuff is PELT (64,32.. based).
>>>
>>> So what I've been thinking is that cpufreq drivers ought to be able to
>>> supply this method, and only when they lack, can the cpufreq-governor
>>> (schedutil) install a fallback. And then cpufreq-cooling can use
>>> whatever is provided (through the cpufreq interfaces).
>>>
>>> That way, we:
>>>
>>> 1) don't have to export anything
>>> 2) get arch drivers to provide something 'better'
>>>
>>>
>>> Does that sounds like something sensible?
>>>
>>
>> Yes, make sense. Please also keep in mind that this
>> utilization somehow must be mapped into power in a proper way.
>> I am currently working on addressing all of these problems
>> (including this correlation).
>
> Right, so that mapping util to power was what I was missing and
> suggesting we do. So for 'simple' hardware we have cpufreq events for
> frequency change, and cpuidle events for idle, and with EM we can simply
> sum the relevant power numbers.
>
> For hardware lacking EM, or hardware managed DVFS, we'll have to fudge
> things a little. How best to do that is up in the air a little, but
> virtual power curves seem a useful tool to me.
>
> The next problem for IPA is having all the devices report power in the
> same virtual unit I suppose, but I'll leave that to others ;-)
>
True, there is more issues. There is also another movement with powercap
driven by Daniel Lezcano, which I am going to support. Maybe he would
be interested as well in having a copy of calculated energy stored
in EM. I must gather some requirements and align with him.
Thank you for your support!
Regards,
Lukasz
Powered by blists - more mailing lists