[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f4156a7-46ca-361d-bcb7-1cbdc860ef37@arm.com>
Date: Thu, 10 Jun 2021 10:36:42 +0100
From: Lukasz Luba <lukasz.luba@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel <linux-kernel@...r.kernel.org>,
"open list:THERMAL" <linux-pm@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Viresh Kumar <viresh.kumar@...aro.org>,
Quentin Perret <qperret@...gle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Vincent Donnefort <vincent.donnefort@....com>,
Beata Michalska <Beata.Michalska@....com>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, segall@...gle.com,
Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>
Subject: Re: [PATCH v2 1/2] sched/fair: Take thermal pressure into account
while estimating energy
On 6/10/21 10:11 AM, Vincent Guittot wrote:
> On Thu, 10 Jun 2021 at 10:42, Lukasz Luba <lukasz.luba@....com> wrote:
>>
>>
>>
>> On 6/10/21 8:59 AM, Vincent Guittot wrote:
>>> On Fri, 4 Jun 2021 at 10:10, Lukasz Luba <lukasz.luba@....com> wrote:
>>>>
>>>> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
>>>> requests made by the SchedUtil governor to properly estimate energy used
>>>> in the future. It has to take into account CPUs utilization and forecast
>>>> Performance Domain (PD) frequency. There is a corner case when the max
>>>> allowed frequency might be reduced due to thermal. SchedUtil is aware of
>>>> that reduced frequency, so it should be taken into account also in EAS
>>>> estimations.
>>>>
>>>> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
>>>> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
>>>> to 'policy::max'. SchedUtil is responsible to respect that upper limit
>>>> while setting the frequency through CPUFreq drivers. This effective
>>>> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
>>>> to predict that value.
>>>>
>>>> In the existing code the raw value of arch_scale_cpu_capacity() is used
>>>> for clamping the returned CPU utilization from effective_cpu_util().
>>>> This patch fixes issue with too big single CPU utilization, by introducing
>>>> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
>>>> capacity reduced by thermal pressure signal. We rely on this load avg
>>>> geometric series in similar way as other mechanisms in the scheduler.
>>>>
>>>> Thanks to knowledge about allowed CPU capacity, we don't get too big value
>>>> for a single CPU utilization, which is then added to the util sum. The
>>>> util sum is used as a source of information for estimating whole PD energy.
>>>> To avoid wrong energy estimation in EAS (due to capped frequency), make
>>>> sure that the calculation of util sum is aware of allowed CPU capacity.
>>>>
>>>> Signed-off-by: Lukasz Luba <lukasz.luba@....com>
>>>> ---
>>>> kernel/sched/fair.c | 17 ++++++++++++++---
>>>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 161b92aa1c79..1aeddecabc20 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6527,6 +6527,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>>>> struct cpumask *pd_mask = perf_domain_span(pd);
>>>> unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>>>> unsigned long max_util = 0, sum_util = 0;
>>>> + unsigned long _cpu_cap = cpu_cap;
>>>> int cpu;
>>>>
>>>> /*
>>>> @@ -6558,14 +6559,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>>>> cpu_util_next(cpu, p, -1) + task_util_est(p);
>>>> }
>>>>
>>>> + /*
>>>> + * Take the thermal pressure from non-idle CPUs. They have
>>>> + * most up-to-date information. For idle CPUs thermal pressure
>>>> + * signal is not updated so often.
>>>
>>> What do you mean by "not updated so often" ? Do you have a value ?
>>>
>>> Thermal pressure is updated at the same rate as other PELT values of
>>> an idle CPU. Why is it a problem there ?
>>>
>>
>>
>> For idle CPU the value is updated 'remotely' by some other CPU
>> running nohz_idle_balance(). That goes into
>> update_blocked_averages() if the flags and checks are OK inside
>> update_nohz_stats(). Sometimes this is not called
>> because other_have_blocked() returned false. It can happen for a long
>
> So i miss that you were in a loop and the below was called for each
> cpu and _cpu_cap was overwritten
>
> + if (!idle_cpu(cpu))
> + _cpu_cap = cpu_cap - thermal_load_avg(cpu_rq(cpu));
>
> But that also means that if the 1st cpus of the pd are idle, they will
> use original capacity whereas the other ones will remove the thermal
> pressure. Isn't this a problem ? You don't use the same capacity for
> all cpus in the performance domain regarding the thermal pressure?
True, but in the experiments for idle CPUs I haven't
observed that they still have some big util (bigger than _cpu_cap).
It decayed already, so it's not a problem for idle CPUs.
Although, it might be my test case which didn't trigger something.
Is it worth to add the loop above this one, to be 100% sure and
get a thermal pressure signal from some running CPU?
Then apply the same value always inside the 2nd loop?
Powered by blists - more mailing lists