linux-kernel - Re: [PATCH V3 2/2] thermal: cpufreq_cooling: Reuse sched_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <be46b60a-0304-8fe0-53cf-3c179a8fd04a@arm.com>
Date:   Fri, 20 Nov 2020 14:51:59 +0000
From:   Lukasz Luba <lukasz.luba@....com>
To:     Viresh Kumar <viresh.kumar@...aro.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Amit Daniel Kachhap <amit.kachhap@...il.com>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Javi Merino <javi.merino@...nel.org>,
        Zhang Rui <rui.zhang@...el.com>,
        Amit Kucheria <amitk@...nel.org>, linux-kernel@...r.kernel.org,
        Quentin Perret <qperret@...gle.com>, linux-pm@...r.kernel.org
Subject: Re: [PATCH V3 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util()
 for SMP platforms

Hi Viresh,

On 11/19/20 7:38 AM, Viresh Kumar wrote:
> Several parts of the kernel are already using the effective CPU
> utilization (as seen by the scheduler) to get the current load on the
> CPU, do the same here instead of depending on the idle time of the CPU,
> which isn't that accurate comparatively.
> 
> This is also the right thing to do as it makes the cpufreq governor
> (schedutil) align better with the cpufreq_cooling driver, as the power
> requested by cpufreq_cooling governor will exactly match the next
> frequency requested by the schedutil governor since they are both using
> the same metric to calculate load.
> 
> Note that, this (and CPU frequency scaling in general) doesn't work that
> well with idle injection as that is done from rt threads and is counted
> as load while it tries to do quite the opposite. That should be solved
> separately though.

I think cpuidle cooling is not used with IPA, but Daniel might correct
me here.

> 
> This was tested on ARM Hikey6220 platform with hackbench, sysbench and
> schbench. None of them showed any regression or significant
> improvements. Schbench is the most important ones out of these as it
> creates the scenario where the utilization numbers provide a better
> estimate of the future.
> 
> Scenario 1: The CPUs were mostly idle in the previous polling window of
> the IPA governor as the tasks were sleeping and here are the details
> from traces (load is in %):
> 
>   Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=203 load={{0x35,0x1,0x0,0x31,0x0,0x0,0x64,0x0}} dynamic_power=1339
>   New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=600 load={{0x60,0x46,0x45,0x45,0x48,0x3b,0x61,0x44}} dynamic_power=3960
> 
> Here, the "Old" line gives the load and requested_power (dynamic_power
> here) numbers calculated using the idle time based implementation, while
> "New" is based on the CPU utilization from scheduler.
> 
> As can be clearly seen, the load and requested_power numbers are simply
> incorrect in the idle time based approach and the numbers collected from
> CPU's utilization are much closer to the reality.

It is contradicting to what you have put in 'Scenario 1' description,
isn't it?
Frequency at 1.2GHz, 75% total_load, power 4W... I'd say if CPUs were
mostly idle than 1.3W would better reflect that state.

What was the IPA period in your setup?

It depends on your platform IPA period (e.g. 100ms) and your current
runqueues state (at that sampling point in time). The PELT decay/rise
period is different. I am not sure if you observe the system avg load
for last e.g. 100ms looking at these signals. Maybe IPA period is too
short/long and couldn't catch up with PELT signals?
But we won't too short averaging, since 16ms is a display tick.

IMHO based on this result it looks like the util could lost older
information from the past or didn't converge yet to this low load yet.

> 
> Scenario 2: The CPUs were busy in the previous polling window of the IPA
> governor:
> 
>   Old: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} dynamic_power=5280
>   New: thermal_power_cpu_get_power: cpus=00000000,000000ff freq=1200000 total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} dynamic_power=4672
> 
> As can be seen, the idle time based load is 100% for all the CPUs as it
> took only the last window into account, but in reality the CPUs aren't
> that loaded as shown by the utilization numbers.

This is also odd. The ~88% of total_load, looks like started decaying or
didn't converge yet to 100% or some task vanished?

Regards,
Lukasz