linux-kernel - Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48db3f08-a066-c078-bfc9-bf20f66e067a@arm.com>
Date:   Mon, 22 May 2023 09:30:14 +0100
From:   Lukasz Luba <lukasz.luba@....com>
To:     Qais Yousef <qyousef@...alina.io>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Wei Wang <wvw@...gle.com>,
        Xuewen Yan <xuewen.yan94@...il.com>,
        Hank <han.lin@...iatek.com>,
        Jonathan JMChen <Jonathan.JMChen@...iatek.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>
Subject: Re: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if
 max_spare_cap is 0

Hi Qais,

I have a question regarding the 'soft cpu affinity'.

On 2/11/23 17:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@...alina.io> wrote:
>>>>
>>>> When uclamp_max is being used, the util of the task could be higher than
>>>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>>>> it there.
>>>>
>>>> The way the condition for checking for max_spare_cap in
>>>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>>>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>>>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>>>> hence ending up never performing compute_energy() for this cluster and
>>>> missing an opportunity for a better energy efficient placement to honour
>>>> uclamp_max setting.
>>>>
>>>>          max_spare_cap = 0;
>>>>          cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>>>>
>>>>          ...
>>>>
>>>>          util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
>>
>> s/true/1/ ?
>>
>>>>
>>>>          ...
>>>>
>>>>          // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>>>>          if (cpu_cap > max_spare_cap) {
>>>>                  max_spare_cap = cpu_cap;
>>>>                  max_spare_cap_cpu = cpu;
>>>>          }
>>>>
>>>> prev_spare_cap suffers from a similar problem.
>>>>
>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>>    for_each_cpu(cpu, pd_cpus)
>>      busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>>      ^^^^^^^^^
>>
>> with:
>>
>>    sum_util = min(busy_time + task_busy_time, pd_cap)
>>                   ^^^^^^^^^
>>
>>    freq = (1.25 * max_util / max) * max_freq
>>
>>    energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
> 
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
> 
> What should be the correct energy (in theory at least)?
> 
>>
>> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Which is the desired behavior for uclamp_max?
> 
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.
> 
>>
>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
> 
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
> 
> 	1. Prevent tasks from consuming energy.
> 	2. Keep them away from expensive CPUs.
> 
> 2 is actually very important for 2 reasons:
> 
> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> 	   a mechanism to downmigrate it - which is another thing I'm working
> 	   on.
> 	b. It is desired to keep these bigger cpu idle ready for more important
> 	   work.
> 
> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.

I'm a bit afraid about such 'strong force'. That means the task would
not go via EAS if we set uclamp_max e.g. 90, while the little capacity
is 125. Or am I missing something?

This might effectively use more energy for those tasks which can run on
any CPU and EAS would figure a good energy placement. I'm worried
about this, since we have L3+littles in one DVFS domain and the L3
would be only bigger in future.

IMO to keep the big cpus more in idle, we should give them big energy
wake up cost. That's my 3rd feature to the EM presented in OSPM2023.

> 
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.

This would we be interesting to see in practice. I think we need such
experiment, for such changes.

Regards,
Lukasz