[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtDXRKt8zOe7XTG8L037myS4DBr+4FXfLEeF2Ai42=s+8g@mail.gmail.com>
Date: Mon, 4 Aug 2025 15:41:31 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Christian Loehle <christian.loehle@....com>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, "Rafael J . Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>, Sudeep Holla <sudeep.holla@....com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org, Robin Murphy <robin.murphy@....com>,
Beata Michalska <beata.michalska@....com>, zhenglifeng1@...wei.com,
Ionela Voinescu <ionela.voinescu@....com>
Subject: Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity
according to boost
On Mon, 4 Aug 2025 at 15:18, Christian Loehle <christian.loehle@....com> wrote:
>
> On 8/4/25 14:01, Vincent Guittot wrote:
> > On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> >>
> >> +cc Vincent Guittot <vincent.guittot@...aro.org>
> >> +cc Ionela Voinescu <ionela.voinescu@....com>
> >>
> >> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> >>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> >>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> >>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
> >>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> >>> default per-policy boost flag") (1) introduced this change.
> >>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> >>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >>>
> >>> The reason is that the 'policy->boost_enabled = true' is now done after
> >>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> >>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> >>> 'highest non-boost' instead of the 'highest boost' frequency.
> >>>
> >>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> >>> cpufreq_online() to which the cpu_capacity setup code in
> >>> [drivers/base/arch_topology.c] has registered.
> >>>
> >>> Its notifier_call init_cpu_capacity_callback() uses
> >>> 'policy->cpuinfo.max_freq' to set the per-cpu
> >>> capacity_freq_ref so that the cpu_capacity can be calculated as:
> >>>
> >>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> >>> 'max system-wide cpu frequency'
> >>>
> >>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >>>
> >>> So before (1) for a little CPU:
> >>>
> >>> cpu_capacity = 578 * 850000 / 1100000 = 446
> >>>
> >>> and after:
> >>>
> >>> cpu_capacity = 578 * 700000 / 800000 = 505
> >>>
> >>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> >>> using the 'turbo-mode' dt property for boosted OPPs.
> >>>
> >>> What's actually needed IMHO is to calculate cpu_capacity according to
> >>> the boost value. I.e.:
> >>>
> >>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> >>> be kept alive after boot.
> >
> > If we adjust the cpu_capacity at runtime this will create oscillation
> > in PELT values. We should stay with one single capacity all time :
> > - either include boost value but when boost is disable we will never
> > reach the max capacity of the cpu which could imply that the cpu will
> > never be overloaded (from scheduler pov)
>
> overutilized I'm assuming, that's the issue I was worried about here.
no I was referring to group_is_overloaded which use /Sum of CPU's capacity
That' also true for EAS and overutilized
> Strictly speaking the platform doesn't guarantee that the capacity can
> be reached and sustained indefinitely. Whether the frequency is marked
> as boost or not.
Regarding thermal mitigation and user max freq, we take that into account
>
> > - either not include boost_value but allow to go above cpu max compute
> > capacity which is something we already discussed for x86 and the turbo
> > freq in the past.
> >
>
> But that currently breaks schedutil, i.e. boost frequencies will never
> be used with schedutil. There's also some other locations where capacities
We should allow capacity to go above 1024 to reflect HW reality with
turbo and here enabling/diabling boost
> >1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
> interface defines SCX_CPUPERF_ONE).
>
>
> So we have either:
> a) Potential wrong capacity estimation of CPUs when boost is disabled
> (but capacity calculation assumed enabled).
> b) Boost frequencies completely unused by schedutil.
> c) Oscillating PELT values due to boost enable/disable.
>
> Isn't c) (what Dietmar proposed here) by far the smallest evil of these
> three?
No, this breaks PELT invariance
> I've also found a) very hard to actually trigger, although it's obviously
> a problem that depends on the platform.
Powered by blists - more mailing lists