[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4439aa79-64d1-48f4-ba5f-fc794fc274d3@arm.com>
Date: Wed, 19 Jun 2024 13:20:37 +0100
From: Lukasz Luba <lukasz.luba@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Kajetan Puchalski <kajetan.puchalski@....com>, rafael@...nel.org,
daniel.lezcano@...aro.org, Dietmar.Eggemann@....com, dsmythies@...us.net,
yu.chen.surf@...il.com, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
Ulf Hansson <ulf.hansson@...aro.org>, Qais Yousef <qyousef@...alina.io>
Subject: Re: [PATCH v6 2/2] cpuidle: teo: Introduce util-awareness
Hi Vincent,
On 6/12/24 10:17, Lukasz Luba wrote:
>
>
> On 6/12/24 10:04, Vincent Guittot wrote:
>> On Wed, 12 Jun 2024 at 09:25, Lukasz Luba <lukasz.luba@....com> wrote:
>>>
>>> Hi Vincent,
>>>
>>> My apologies for delay, I was on sick leave.
>>>
>>> On 5/28/24 15:07, Vincent Guittot wrote:
>>>> On Tue, 28 May 2024 at 11:59, Lukasz Luba <lukasz.luba@....com> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On 5/28/24 10:29, Vincent Guittot wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> I'm quite late on this thread but this patchset creates a major
>>>>>> regression for psci cpuidle driver when using the OSI mode (OS
>>>>>> initiated mode). In such a case, cpuidle driver takes care only of
>>>>>> CPUs power state and the deeper C-states ,which includes cluster and
>>>>>> other power domains, are handled with power domain framework. In such
>>>>>> configuration ,cpuidle has only 2 c-states : WFI and cpu off states
>>>>>> and others states that include the clusters, are managed by genpd and
>>>>>> its governor.
>>>>>>
>>>>>> This patch selects cpuidle c-state N-1 as soon as the utilization is
>>>>>> above CPU capacity / 64 which means at most a level of 16 on the big
>>>>>> core but can be as low as 4 on little cores. These levels are very
>>>>>> low
>>>>>> and the main result is that as soon as there is very little activity
>>>>>> on a CPU, cpuidle always selects WFI states whatever the estimated
>>>>>> sleep duration and which prevents any deeper states. Another
>>>>>> effect is
>>>>>> that it also keeps the tick firing every 1ms in my case.
>>>>>
>>>>> Thanks for reporting this.
>>>>> Could you add what regression it's causing, please?
>>>>> Performance or higher power?
>>>>
>>>> It's not a perf but rather a power regression. I don't have a power
>>>> counter so it's difficult to give figures but I found it while running
>>>> a unitary test below on my rb5:
>>>> run 500us every 19457ms on medium core (uclamp_min: 600).
>>>
>>> Mid cores are built differently, they have low static power (leakage).
>>> Therefore, for them the residency in deeper idle state should be
>>> longer than for Big CPU. When you power off the CPU you loose your
>>> cache data/code. The data needs to be stored in the L3 or
>>> further memory. When the cpu is powered on again, it needs code & data.
>>> Thus, it will transfer that data/code from L3 or from DDR. That
>>> information transfer has energy cost (it's not for free). The cost
>>> of data from DDR is very high.
>>> Then we have to justify if the energy lost while sleeping in shallower
>>> idle state can be higher than loading data/code from outside.
>>> For different CPU it would be different.
>>
>> I'm aware of these points and the residency time of an idle state is
>> set to reflect this cost. In my case, the idle time is far above the
>> residency time which means that we should get some energy saving.
>> cpu off 4.488ms
>> cluster off 9.987ms
>> vs
>> sleep duration 18.000ms
>>
>> Also, the policy of selecting a shallower idle state than the final
>> selected one doesn't work with PSCI OSI because cpuidle is only aware
>> of per CPU idle states but it is not aware of the cluster or
>> deeper/wider idle states so cpuidle doesn't know what will be the
>> final selected idle state. This is a major problem, in addition to
>> keep the tick firing
>
> I think we are aligned with this.
> Something has to change in this implementation of idle gov.
>
> I'm a bit more skeptical about your second point with PSCI.
> That standard might be to strong to change.
>
I'm coming back to you with some public information about our WFI
idle state. WFI can be not only the clock-gating thing, it can
automatically put the CPU into retention mode, which saves the
static power.
That's why I said WFI can be really efficient and we can/should
leverage that. That's also why we shouldn't assume power numbers
based on statistics of idle states (especially available in kernel).
Please check TRM for Cortex-X1, section:
A4.6.4 Core dynamic retention mode [1].
The period after which the HW can decide to enter retention mode
is configurable via registers. It's up to our vendors to experiment
and implement the right configuration.
It's up to the vendor to try that and some of them do this AFAIK.
We should really talk based on the data including power from experiments
and also investigate deeper if the right configurations are
used in HW.
Regards,
Lukasz
[1]
https://developer.arm.com/documentation/101433/0102/Functional-description/Power-management-/Core-power-modes/Core-dynamic-retention-mode
Powered by blists - more mailing lists