[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fa10040-7e48-4100-9d70-cbbac406abde@arm.com>
Date: Tue, 19 Aug 2025 10:10:30 +0100
From: Christian Loehle <christian.loehle@....com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Linux PM <linux-pm@...r.kernel.org>,
Frederic Weisbecker <frederic@...nel.org>,
LKML <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full
CPUs
On 8/18/25 18:41, Rafael J. Wysocki wrote:
> On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> <christian.loehle@....com> wrote:
>>
>> On 8/13/25 11:29, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>>
>>> When the menu governor runs on a nohz_full CPU and there are no user
>>> space timers in the workload on that CPU, it ends up selecting idle
>>> states with target residency values above TICK_NSEC all the time due to
>>> a tick_nohz_tick_stopped() check designed for a different use case.
>>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
>>> not actually mean anything in particular, whereas in the other case it
>>> indicates that previously the CPU was expected to be idle sufficiently
>>> long for the tick to be stopped, so it is not unreasonable to expect
>>> it to be idle beyond the tick period length again.
>>>
>>> In some cases, this behavior causes latency in the workload to grow
>>> undesirably. It may also cause the workload to consume more energy
>>> than necessary if the CPU does not spend enough time in the selected
>>> deep idle states.
>>>
>>> Address this by amending the tick_nohz_tick_stopped() check in question
>>> with a tick_nohz_full_cpu() one to avoid using the time till the next
>>> timer event as the predicted_ns value all the time on nohz_full CPUs.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>> ---
>>> drivers/cpuidle/governors/menu.c | 12 +++++++++++-
>>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -293,8 +293,18 @@
>>> * in a shallow idle state for a long time as a result of it. In that
>>> * case, say we might mispredict and use the known time till the closest
>>> * timer event for the idle state selection.
>>> + *
>>> + * However, on nohz_full CPUs the tick does not run as a rule and the
>>> + * time till the closest timer event may always be effectively infinite,
>>> + * so using it as a replacement for the predicted idle duration would
>>> + * effectively always cause the prediction results to be discarded and
>>> + * deep idle states to be selected all the time. That might introduce
>>> + * unwanted latency into the workload and cause more energy than
>>> + * necessary to be consumed if the discarded prediction results are
>>> + * actually accurate, so skip nohz_full CPUs here.
>>> */
>>> - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>>> + if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
>>> + predicted_ns < TICK_NSEC)
>>> predicted_ns = data->next_timer_ns;
>>>
>>> /*
>>>
>>>
>>>
>>
>> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
>> get_typical_interval() (which may suggest picking a shallow state based on
>> previous wakeup patterns) only then to never wake up again?
>
> Yes, there is this risk, but the current behavior is more damaging IMV
> because it (potentially) hurts both energy efficiency and performance.
>
> It is also arguably easier for the user to remedy getting stuck in a
> shallow idle state than to change governor's behavior (PM QoS is a bit
> too blunt for this).
>
> Moreover, configuring CPUs as nohz_full and leaving them in long idle
> may not be the most efficient use of them.
True, on the other hand the setup cost for nohz_full is so high, you'd expect
the additional idle states disabling depending on the workload isn't too much
to ask for...
Anyway feel free to go ahead.
Powered by blists - more mailing lists