linux-kernel - Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3fa10040-7e48-4100-9d70-cbbac406abde@arm.com>
Date: Tue, 19 Aug 2025 10:10:30 +0100
From: Christian Loehle <christian.loehle@....com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Linux PM <linux-pm@...r.kernel.org>,
 Frederic Weisbecker <frederic@...nel.org>,
 LKML <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v1 3/3] cpuidle: governors: menu: Special-case nohz_full
 CPUs

On 8/18/25 18:41, Rafael J. Wysocki wrote:
> On Thu, Aug 14, 2025 at 4:09 PM Christian Loehle
> <christian.loehle@....com> wrote:
>>
>> On 8/13/25 11:29, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>>
>>> When the menu governor runs on a nohz_full CPU and there are no user
>>> space timers in the workload on that CPU, it ends up selecting idle
>>> states with target residency values above TICK_NSEC all the time due to
>>> a tick_nohz_tick_stopped() check designed for a different use case.
>>> Namely, on nohz_full CPUs the fact that the tick has been stopped does
>>> not actually mean anything in particular, whereas in the other case it
>>> indicates that previously the CPU was expected to be idle sufficiently
>>> long for the tick to be stopped, so it is not unreasonable to expect
>>> it to be idle beyond the tick period length again.
>>>
>>> In some cases, this behavior causes latency in the workload to grow
>>> undesirably.  It may also cause the workload to consume more energy
>>> than necessary if the CPU does not spend enough time in the selected
>>> deep idle states.
>>>
>>> Address this by amending the tick_nohz_tick_stopped() check in question
>>> with a tick_nohz_full_cpu() one to avoid using the time till the next
>>> timer event as the predicted_ns value all the time on nohz_full CPUs.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>> ---
>>>  drivers/cpuidle/governors/menu.c |   12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -293,8 +293,18 @@
>>>        * in a shallow idle state for a long time as a result of it.  In that
>>>        * case, say we might mispredict and use the known time till the closest
>>>        * timer event for the idle state selection.
>>> +      *
>>> +      * However, on nohz_full CPUs the tick does not run as a rule and the
>>> +      * time till the closest timer event may always be effectively infinite,
>>> +      * so using it as a replacement for the predicted idle duration would
>>> +      * effectively always cause the prediction results to be discarded and
>>> +      * deep idle states to be selected all the time.  That might introduce
>>> +      * unwanted latency into the workload and cause more energy than
>>> +      * necessary to be consumed if the discarded prediction results are
>>> +      * actually accurate, so skip nohz_full CPUs here.
>>>        */
>>> -     if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
>>> +     if (tick_nohz_tick_stopped() && !tick_nohz_full_cpu(dev->cpu) &&
>>> +         predicted_ns < TICK_NSEC)
>>>               predicted_ns = data->next_timer_ns;
>>>
>>>       /*
>>>
>>>
>>>
>>
>> OTOH the behaviour with $SUBJECT possibly means that we use predicted_ns from
>> get_typical_interval() (which may suggest picking a shallow state based on
>> previous wakeup patterns) only then to never wake up again?
> 
> Yes, there is this risk, but the current behavior is more damaging IMV
> because it (potentially) hurts both energy efficiency and performance.
> 
> It is also arguably easier for the user to remedy getting stuck in a
> shallow idle state than to change governor's behavior (PM QoS is a bit
> too blunt for this).
> 
> Moreover, configuring CPUs as nohz_full and leaving them in long idle
> may not be the most efficient use of them.

True, on the other hand the setup cost for nohz_full is so high, you'd expect
the additional idle states disabling depending on the workload isn't too much
to ask for...
Anyway feel free to go ahead.