linux-kernel - Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <16b728e6-6fb9-48eb-8160-73c4ace229d2@arm.com>
Date: Mon, 4 Aug 2025 14:18:22 +0100
From: Christian Loehle <christian.loehle@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Rafael J . Wysocki" <rafael@...nel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>, Sudeep Holla <sudeep.holla@....com>,
 linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
 Robin Murphy <robin.murphy@....com>,
 Beata Michalska <beata.michalska@....com>, zhenglifeng1@...wei.com,
 Ionela Voinescu <ionela.voinescu@....com>
Subject: Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity
 according to boost

On 8/4/25 14:01, Vincent Guittot wrote:
> On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>
>> +cc Vincent Guittot <vincent.guittot@...aro.org>
>> +cc Ionela Voinescu <ionela.voinescu@....com>
>>
>> On 26/06/2025 11:30, Dietmar Eggemann wrote:
>>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
>>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
>>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
>>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
>>> default per-policy boost flag") (1) introduced this change.
>>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
>>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
>>>
>>> The reason is that the 'policy->boost_enabled = true' is now done after
>>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
>>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
>>> 'highest non-boost' instead of the 'highest boost' frequency.
>>>
>>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
>>> cpufreq_online() to which the cpu_capacity setup code in
>>> [drivers/base/arch_topology.c] has registered.
>>>
>>> Its notifier_call init_cpu_capacity_callback() uses
>>> 'policy->cpuinfo.max_freq' to set the per-cpu
>>> capacity_freq_ref so that the cpu_capacity can be calculated as:
>>>
>>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
>>>                                     'max system-wide cpu frequency'
>>>
>>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
>>>
>>> So before (1) for a little CPU:
>>>
>>> cpu_capacity = 578 * 850000 / 1100000 = 446
>>>
>>> and after:
>>>
>>> cpu_capacity = 578 * 700000 / 800000 = 505
>>>
>>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
>>> using the 'turbo-mode' dt property for boosted OPPs.
>>>
>>> What's actually needed IMHO is to calculate cpu_capacity according to
>>> the boost value. I.e.:
>>>
>>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
>>>     be kept alive after boot.
> 
> If we adjust the cpu_capacity at runtime this will create oscillation
> in PELT values. We should stay with one single capacity all time :
> - either include boost value but when boost is disable we will never
> reach the max capacity of the cpu which could imply that the cpu will
> never be overloaded (from scheduler pov)

overutilized I'm assuming, that's the issue I was worried about here.
Strictly speaking the platform doesn't guarantee that the capacity can
be reached and sustained indefinitely. Whether the frequency is marked
as boost or not.

> - either not include boost_value but allow to go above cpu max compute
> capacity which is something we already discussed for x86 and the turbo
> freq in the past.
> 

But that currently breaks schedutil, i.e. boost frequencies will never
be used with schedutil. There's also some other locations where capacities
>1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
interface defines SCX_CPUPERF_ONE).

So we have either:
a) Potential wrong capacity estimation of CPUs when boost is disabled
(but capacity calculation assumed enabled).
b) Boost frequencies completely unused by schedutil.
c) Oscillating PELT values due to boost enable/disable.

Isn't c) (what Dietmar proposed here) by far the smallest evil of these
three?
I've also found a) very hard to actually trigger, although it's obviously
a problem that depends on the platform.