[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAwy1ZFQ=-t7SbbDuHj6ZJPtB3pJS6fZxt=1robLwvXjg@mail.gmail.com>
Date: Mon, 4 Aug 2025 15:01:10 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: "Rafael J . Wysocki" <rafael@...nel.org>, Viresh Kumar <viresh.kumar@...aro.org>,
Sudeep Holla <sudeep.holla@....com>, Christian Loehle <christian.loehle@....com>,
linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
Robin Murphy <robin.murphy@....com>, Beata Michalska <beata.michalska@....com>, zhenglifeng1@...wei.com,
Ionela Voinescu <ionela.voinescu@....com>
Subject: Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity
according to boost
On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> +cc Vincent Guittot <vincent.guittot@...aro.org>
> +cc Ionela Voinescu <ionela.voinescu@....com>
>
> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> > I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> > the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> > little CPU changed in v6.14 from 446 to 505. I bisected and found that
> > commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> > default per-policy boost flag") (1) introduced this change.
> > Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> > 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >
> > The reason is that the 'policy->boost_enabled = true' is now done after
> > 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> > in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> > 'highest non-boost' instead of the 'highest boost' frequency.
> >
> > This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> > cpufreq_online() to which the cpu_capacity setup code in
> > [drivers/base/arch_topology.c] has registered.
> >
> > Its notifier_call init_cpu_capacity_callback() uses
> > 'policy->cpuinfo.max_freq' to set the per-cpu
> > capacity_freq_ref so that the cpu_capacity can be calculated as:
> >
> > cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> > 'max system-wide cpu frequency'
> >
> > (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >
> > So before (1) for a little CPU:
> >
> > cpu_capacity = 578 * 850000 / 1100000 = 446
> >
> > and after:
> >
> > cpu_capacity = 578 * 700000 / 800000 = 505
> >
> > This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> > using the 'turbo-mode' dt property for boosted OPPs.
> >
> > What's actually needed IMHO is to calculate cpu_capacity according to
> > the boost value. I.e.:
> >
> > (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> > be kept alive after boot.
If we adjust the cpu_capacity at runtime this will create oscillation
in PELT values. We should stay with one single capacity all time :
- either include boost value but when boost is disable we will never
reach the max capacity of the cpu which could imply that the cpu will
never be overloaded (from scheduler pov)
- either not include boost_value but allow to go above cpu max compute
capacity which is something we already discussed for x86 and the turbo
freq in the past.
> >
> > (b) There has to be some kind of notification from cpufreq.c to
> > arch_topology.c about the toggling of boost. I'm abusing
> > CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
> > CPUFREQ_MOD_POLICY for this?
> >
> > (c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
> > is set to 0 in cpufreq_frequency_table_cpuinfo().
> > This currently clashes with the commented feature that in case the
> > driver has set a higher value it should stay untouched.
> >
> > Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
> > added 'turbo-mode' to the topmost OPPs in dts file).
> >
> > This is probably related what Christian Loehle tried to address in
> > https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .
>
> Christian L. reminded me that since commit dd016f379ebc we also have a
> performance regression on a system with boosted OPPs using schedutil
> CPUfreq governor.
>
> The reason is that per cpu 'capacity_freq_ref' is set in
> drivers/base/arch_topology.c only during system boot so far based on the
> highest non-boosted OPP since boost is disabled per default.
>
> Schedutil uses capacity_freq_ref (*) in get_next_freq() to calculate the
> next frequency request:
>
> next_freq = max_freq * util / max
> ^^^^^^^^
> (*)
>
> In case the boost OPPs will be enabled:
>
> echo 1 > /sys/devices/system/cpu/cpufreq/boost
>
> 'capacity_freq_ref' stays at the highest non-boosted OPP's so schedutil
> won't request any boosted OPPs for util values > ''highest non boosted
> OPP'/'highest boosted OPP' * max'. The 'highest non boosted OPP' will be
> used by schedutil instead.
>
> This performance regression will go away with the proposed patch as well.
>
> Calling drivers/base/arch_topology.c's init_cpu_capacity_callback() in
> the event that boost is toggled makes sure that 'capacity_freq_ref' will
> be set to the highest boosted (0->1) or highest non-boosted (1->0) OPP.
>
> [...]
>
>
>
>
>
Powered by blists - more mailing lists