[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d92b035-4bf7-5199-b852-540ff3edee98@arm.com>
Date: Tue, 14 Feb 2023 13:47:05 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Lukasz Luba <lukasz.luba@....com>,
Wei Wang <wvw@...gle.com>, Xuewen Yan <xuewen.yan94@...il.com>,
Hank <han.lin@...iatek.com>,
Jonathan JMChen <Jonathan.JMChen@...iatek.com>
Subject: Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if
max_spare_cap is 0
On 11/02/2023 18:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@...alina.io> wrote:
[...]
>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>> for_each_cpu(cpu, pd_cpus)
>> busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>> ^^^^^^^^^
>>
>> with:
>>
>> sum_util = min(busy_time + task_busy_time, pd_cap)
>> ^^^^^^^^^
>>
>> freq = (1.25 * max_util / max) * max_freq
>>
>> energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
>
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
>
> What should be the correct energy (in theory at least)?
The energy value for the perf_state is correct but the decision based
on `energy diff` in which PD to put the task might not be.
In case CPUx already has some tasks, its `pd_busy_time` contribution
is still capped by its capacity_orig.
eenv_pd_busy_time() -> cpu_util_next()
return min(util, capacity_orig_of(cpu))
In this case we can't use `energy diff` anymore to make the decision
where to put the task.
The base_energy of CPUx's PD is already so high that the
`energy diff = cur_energy - base_energy` is small enough so that CPUx
is selected as target again.
>> So EAS keeps packing on the cheaper PD/clamped OPP.
Sometimes yes, but there are occurrences in which a big CPU ends up
with the majority of the tasks. It depends on the start condition.
> Which is the desired behavior for uclamp_max?
Not sure. Essentially, EAS can't do its job properly if there is no idle
time left. As soon as uclamp_max restricts the system (like in my
example) task placement decisions via EAS (min energy diff based) can be
wrong.
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.
I assume you refer to
} else if ((fits > max_fits) ||
((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
here?
>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
>
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
>
> 1. Prevent tasks from consuming energy.
> 2. Keep them away from expensive CPUs.
Like I tried to reason about above, the energy diff based task placement
can't always assure this.
>
> 2 is actually very important for 2 reasons:
>
> a. Because of max aggregation - any uncapped tasks that wakes up will
> cause a frequency spike on this 'expensive' cpu. We don't have
> a mechanism to downmigrate it - which is another thing I'm working
> on.
True. And it can also lead to CPU overutilization which then leads to
load-balancing kicking in.
> b. It is desired to keep these bigger cpu idle ready for more important
> work.
But this is not happening all the time. Using the same workload
(6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
sometimes ends up with all 6 tasks on big CPU1.
A corresponding EAS task placement looks like this one:
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1 energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
^^^^^^^^^^^^^
diff = 20,568
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446]
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5 energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446
^^^^^^^^^^^^^
diff = 233,998
<idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1
> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.
>
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.
[...]
Powered by blists - more mailing lists