linux-kernel - Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bcad0cfe-c96c-a933-6784-325f67d34c62@arm.com>
Date:   Fri, 17 Jul 2020 10:55:36 +0100
From:   Lukasz Luba <lukasz.luba@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Viresh Kumar <viresh.kumar@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Zhang Rui <rui.zhang@...el.com>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Daniel Kachhap <amit.kachhap@...il.com>,
        Javi Merino <javi.merino@...nel.org>,
        Amit Kucheria <amit.kucheria@...durent.com>,
        linux-kernel@...r.kernel.org, Quentin Perret <qperret@...gle.com>,
        Rafael Wysocki <rjw@...ysocki.net>, linux-pm@...r.kernel.org
Subject: Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()



On 7/16/20 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 03:24:37PM +0100, Lukasz Luba wrote:
>> On 7/16/20 12:56 PM, Peter Zijlstra wrote:
> 
>>> The second attempts to guesstimate power, and is the subject of this
>>> patch.
>>>
>>> Currently cpufreq_cooling appears to estimate the CPU energy usage by
>>> calculating the percentage of idle time using the per-cpu cpustat stuff,
>>> which is pretty horrific.
>>
>> Even worse, it then *samples* the *current* CPU frequency at that
>> particular point in time and assumes that when the CPU wasn't idle
>> during that period - it had *this* frequency...
> 
> *whee* :-)
> 
> ...
> 
>> In EM we keep power values in the array and these values grow
>> exponentially. Each OPP has it corresponding
>>
>> P_x = C (V_x)^2 f_x    , where x is the OPP id thus corresponding V,f
>>
>> so we have discrete power values, growing like:
>>
>> ^(power)
>> |
>> |
>> |                          *
>> |
>> |
>> |                       *
>> |                       |
>> |                   *   |
>> |                       | <----- power estimation function
>> |            *          |        should not use linear 'util/max_util'
>> |   *                   |        relation here *
>> |_______________________|_____________> (freq)
>>     opp0     opp1  opp2 opp3 opp4
>>
>> What is the problem
>> First:
>> We need to pick the right Power from the array. I would suggest
>> to pick the max allowed frequency for that whole period, because
>> we don't know if the CPUs were using it (it's likely).
>> Second:
>> Then we have the utilization, which can be considered as:
>> 'idle period & running period with various freq inside', lets
>> call it avg performance in that whole period.
>> Third:
>> Try to estimate the power used in that whole period having
>> the avg performance and max performance.
>>
>> What you are suggesting is to travel that [*] line in
>> non-linear fashion, but in (util^3)/(max_util^3). Which means
>> it goes down faster when the utilization drops.
>> I think it is too aggressive, e.g.
>> 500^3 / 1024^3 = 0.116  <--- very little, ~12%
>> 200^3 / 300^3  = 0.296
>>
>> Peter could you confirm if I understood you correct?
> 
> Correct, with the caveat that we might try and regression fit a 3rd
> order polynomial to a bunch of EM data to see if there's a 'better'
> function to be had than a raw 'f(x) := x^3'.

I agree, I think we are on the same wavelength now.

> 
>> This is quite important bit for me.
> 
> So, if we assume schedutil + EM, we can actually have schedutil
> calculate a running power sum. That is, something like: \Int P_x dt.
> Because we know the points where OPP changes.

Yes, that's why I was thinking about having this information stored as a
copy inside the EM, then just read it in other subsystem like: thermal,
powercap, etc.

> 
> Although, thinking more, I suspect we need tighter integration with
> cpuidle, because we don't actually have idle times here, but that should
> be doable.

I am scratching my head for while because of that idle issue. It opens
more dimensions to tackle.

> 
> But for anything other than schedutil + EM, things become more
> interesting, because then we need to guesstimate power usage without the
> benefit of having actual power numbers.

Yes, from the engineering/research perspective, platforms which do not
have EM in Linux (like Intel) are also interesting.

> 
> We can of course still do that running power sum, with whatever P(u) or
> P(f) end up with, I suppose.
> 
>>> Another point is that cpu_util() vs turbo is a bit iffy, and to that,
>>> things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
>>> have the benefit of giving you values that match your own sampling
>>> interval (100ms), where the sched stuff is PELT (64,32.. based).
>>>
>>> So what I've been thinking is that cpufreq drivers ought to be able to
>>> supply this method, and only when they lack, can the cpufreq-governor
>>> (schedutil) install a fallback. And then cpufreq-cooling can use
>>> whatever is provided (through the cpufreq interfaces).
>>>
>>> That way, we:
>>>
>>>    1) don't have to export anything
>>>    2) get arch drivers to provide something 'better'
>>>
>>>
>>> Does that sounds like something sensible?
>>>
>>
>> Yes, make sense. Please also keep in mind that this
>> utilization somehow must be mapped into power in a proper way.
>> I am currently working on addressing all of these problems
>> (including this correlation).
> 
> Right, so that mapping util to power was what I was missing and
> suggesting we do. So for 'simple' hardware we have cpufreq events for
> frequency change, and cpuidle events for idle, and with EM we can simply
> sum the relevant power numbers.
> 
> For hardware lacking EM, or hardware managed DVFS, we'll have to fudge
> things a little. How best to do that is up in the air a little, but
> virtual power curves seem a useful tool to me.
> 
> The next problem for IPA is having all the devices report power in the
> same virtual unit I suppose, but I'll leave that to others ;-)
> 

True, there is more issues. There is also another movement with powercap
driven by Daniel Lezcano, which I am going to support. Maybe he would
be interested as well in having a copy of calculated energy stored
in EM. I must gather some requirements and align with him.

Thank you for your support!

Regards,
Lukasz