linux-kernel - Re: [RFT][PATCH v1 5/5] cpuidle: menu: Avoid discarding useful information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com>
Date: Tue, 5 Aug 2025 15:41:27 +0100
From: Christian Loehle <christian.loehle@....com>
To: "Rafael J. Wysocki" <rafael@...nel.org>, Marc Zyngier <maz@...nel.org>
Cc: Linux PM <linux-pm@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>,
 Thomas Gleixner <tglx@...utronix.de>, Mark Rutland <mark.rutland@....com>
Subject: Re: [RFT][PATCH v1 5/5] cpuidle: menu: Avoid discarding useful
 information

On 8/5/25 14:23, Rafael J. Wysocki wrote:
> On Mon, Aug 4, 2025 at 6:54 PM Marc Zyngier <maz@...nel.org> wrote:
>>
>> [+ Thomas, Mark]
>>
>> On Thu, 06 Feb 2025 14:29:05 +0000,
>> "Rafael J. Wysocki" <rjw@...ysocki.net> wrote:
>>>
>>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>>
>>> When giving up on making a high-confidence prediction,
>>> get_typical_interval() always returns UINT_MAX which means that the
>>> next idle interval prediction will be based entirely on the time till
>>> the next timer.  However, the information represented by the most
>>> recent intervals may not be completely useless in those cases.
>>>
>>> Namely, the largest recent idle interval is an upper bound on the
>>> recently observed idle duration, so it is reasonable to assume that
>>> the next idle duration is unlikely to exceed it.  Moreover, this is
>>> still true after eliminating the suspected outliers if the sample
>>> set still under consideration is at least as large as 50% of the
>>> maximum sample set size.
>>>
>>> Accordingly, make get_typical_interval() return the current maximum
>>> recent interval value in that case instead of UINT_MAX.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>> ---
>>>  drivers/cpuidle/governors/menu.c |   13 ++++++++++++-
>>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/menu.c
>>> +++ b/drivers/cpuidle/governors/menu.c
>>> @@ -190,8 +190,19 @@
>>>        * This can deal with workloads that have long pauses interspersed
>>>        * with sporadic activity with a bunch of short pauses.
>>>        */
>>> -     if ((divisor * 4) <= INTERVALS * 3)
>>> +     if (divisor * 4 <= INTERVALS * 3) {
>>> +             /*
>>> +              * If there are sufficiently many data points still under
>>> +              * consideration after the outliers have been eliminated,
>>> +              * returning without a prediction would be a mistake because it
>>> +              * is likely that the next interval will not exceed the current
>>> +              * maximum, so return the latter in that case.
>>> +              */
>>> +             if (divisor >= INTERVALS / 2)
>>> +                     return max;
>>> +
>>>               return UINT_MAX;
>>> +     }
>>>
>>>       /* Update the thresholds for the next round. */
>>>       if (avg - min > max - avg)
>>
>> It appears that this patch, which made it in 6.15, results in *a lot*
>> of extra interrupts on one of my arm64 test machines.
>>
>> * Without this patch:
>>
>> maz@...-leg-emma:~$ vmstat -y 1
>> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>>  1  0      0 65370828  29244 106088    0    0     0     0   66   26  0  0 100  0  0
>>  1  0      0 65370828  29244 106088    0    0     0     0  103   66  0  0 100  0  0
>>  1  0      0 65370828  29244 106088    0    0     0     0   34   12  0  0 100  0  0
>>  1  0      0 65370828  29244 106088    0    0     0     0   25   12  0  0 100  0  0
>>  1  0      0 65370828  29244 106088    0    0     0     0   28   14  0  0 100  0  0
>>
>> we're idling at only a few interrupts per second, which isn't bad for
>> a 24 CPU toy.
>>
>> * With this patch:
>>
>> maz@...-leg-emma:~$ vmstat -y 1
>> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>>  1  0      0 65361024  28420 105388    0    0     0     0 3710   27  0  0 100  0  0
>>  1  0      0 65361024  28420 105388    0    0     0     0 3399   20  0  0 100  0  0
>>  1  0      0 65361024  28420 105388    0    0     0     0 4439   78  0  0 100  0  0
>>  1  0      0 65361024  28420 105388    0    0     0     0 5634   14  0  0 100  0  0
>>  1  0      0 65361024  28420 105388    0    0     0     0 5575   14  0  0 100  0  0
>>
>> we're idling at anywhere between 3k and 6k interrupts per second. Not
>> exactly what you want. This appears to be caused by the broadcast
>> timer IPI.
>>
>> Reverting this patch on top of 6.16 restores sanity on this machine.
> 
> I don't know what is going on here, but it looks highly suspicious to me.
> 
> The only effect of the change in question should be selecting a
> shallower idle state occasionally and why would this alone cause the
> number of wakeup interrupts to increase?
> 
> Arguably, it might interfere with the tick stopping logic if
> predicted_ns happened to be less than TICK_NSEC sufficiently often,
> but that is not expected to happen on an idle system because in that
> case the average interval between genuine wakeups is relatively large.
> The tick itself is not counted as a wakeup event, so returning a
> shallower state at one point shouldn't affect future predictions, but
> the data above suggests that it actually does affect them.
> 
> It looks like selecting a shallower idle state by the governor at one
> point causes more wakeup interrupts to occur in the future which is
> really note expected to happen.
> 
> Christian, what do you think?
> 
So I looked a bit into it, unfortunately I don't have the box in question.
Apparently psci cpuidle is broken (and never worked), it has:

/sys/devices/system/cpu/cpu0/cpuidle/driver/name: psci_idle
/sys/devices/system/cpu/cpu0/cpuidle/state0/disable: 0
/sys/devices/system/cpu/cpu0/cpuidle/state0/above: 0
/sys/devices/system/cpu/cpu0/cpuidle/state0/time: 1206224
/sys/devices/system/cpu/cpu0/cpuidle/state0/rejected: 0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power: 4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/residency: 1
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency: 1
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage: 1447
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc: ARM WFI
/sys/devices/system/cpu/cpu0/cpuidle/state0/below: 182
/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status: enabled
/sys/devices/system/cpu/cpu0/cpuidle/state0/name: WFI
/sys/devices/system/cpu/cpu0/cpuidle/state1/disable: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/above: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/time: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/rejected: 17021
/sys/devices/system/cpu/cpu0/cpuidle/state1/power: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/residency: 2000
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency: 1500
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc: cpu-sleep-0
/sys/devices/system/cpu/cpu0/cpuidle/state1/below: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status: enabled
/sys/devices/system/cpu/cpu0/cpuidle/state1/name: cpu-sleep-0
/sys/devices/system/cpu/cpu0/cpuidle/state1/s2idle/time: 0
/sys/devices/system/cpu/cpu0/cpuidle/state1/s2idle/usage: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/disable: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/above: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/time: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/rejected: 19598056
/sys/devices/system/cpu/cpu0/cpuidle/state2/power: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/residency: 2500
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency: 1600
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc: cluster-sleep-0
/sys/devices/system/cpu/cpu0/cpuidle/state2/below: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status: enabled
/sys/devices/system/cpu/cpu0/cpuidle/state2/name: cluster-sleep-0
/sys/devices/system/cpu/cpu0/cpuidle/state2/s2idle/time: 0
/sys/devices/system/cpu/cpu0/cpuidle/state2/s2idle/usage:: 0

(similar on all CPUs, i.e. it never accounts any (!WFI) cpuidle states
as entered, only as rejected. (for both with and without this patch,
rejected is by factor 10 or so bigger with this patch, which is
what Marc is saying and complaining about.)
If cpuidle enter fails this will report as 0 idle residency in cpuidle.c,
which indeed makes these statistical tests bogus.
I guess we could add a safeguard here again that at least requires
most of the values to be != 0? The real issue is that cpuidle is
being used while it's clearly not working as expected. So we could
also bake some sanity-checks into cpuidle core.
I'd be curious if this is more common or is this really the only
platform were this occurs (and went unnoticed all this time).