[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250730220812.53098-1-pmalani@google.com>
Date: Wed, 30 Jul 2025 22:07:58 +0000
From: Prashant Malani <pmalani@...gle.com>
To: open list <linux-kernel@...r.kernel.org>,
"open list:CPU FREQUENCY SCALING FRAMEWORK" <linux-pm@...r.kernel.org>, "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>
Cc: Prashant Malani <pmalani@...gle.com>, Yang Shi <yang@...amperecomputing.com>,
Ionela Voinescu <Ionela.Voinescu@....com>
Subject: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
On a heavily loaded CPU, performance counter reads can be erratic. This is
due to two factors:
- The method used to calculate CPPC delivered performance.
- Linux scheduler vagaries.
As an example, on a CPU which has a max frequency of 3.4 GHz, if we run
stress-ng on the CPU in the background and then read the frequency, we get
invalid readings:
./stress_ng --cpu 108 --taskset 3 -t 30s &
cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq
3600000
3500000
3600000
Per [1] CPPC performance is measured by reading the delivered and reference
counters at timestamp t0, then waiting 2us, and then repeating the
measurement at t1. So, in theory, one should end up with:
Timestamp t0: ref0, del0
Timestamp t1: ref1, del1
However, since the reference and delivered registers are individual
register reads (in the case of FFH, it even results in an IPI to the CPU in
question), what happens in practice is:
Timestamp t0: del0
Timestamp t0 + m: ref0
Timestamp t1: del1
Timestamp t1 + n: ref1
There has been prior discussion[2] about the cause of these differences;
it was broadly pegged as due to IRQs and "interconnect congestion".
Since the gap between t0 and t1 is very small (2us), differing values of m
and n mean that the measurements don't correspond to 2 discrete timestamps,
since the delivered performance delta is being measured across a
significantly different time period than the reference performance
delta. This has an influence on the perf measurement which is:
((del1 - del0) * reference perf) / (ref1 - ref0)
Previously collected data[4] shows that cppc_get_perf_ctrs() itself
takes anywhere between 4.9us and 3.6us, which further suggests that a
2us delta is too less.
If we increase the time delta to a high enough value (i.e if delay >> m,n)
then the effects of m and n get mitigated, leading to both the register
measurements (ref and del) corresponding to the same timestamp.
When this approach was previously proposed[3], there was concern about
this function being called with interrupts off but that was later found to
be not true [2]. So, waiting for a slightly longer time in between counter
samples should be acceptable.
Increase the time delay between counter reads to 200 us to reduce the
effect of timing discrepancies in reading individual performance registers.
[1] https://docs.kernel.org/admin-guide/acpi/cppc_sysfs.html#computing-average-delivered-performance
[2] https://lore.kernel.org/all/7b57e680-0ba3-0b8b-851e-7cc369050386@os.amperecomputing.com/
[3] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
[4] https://lore.kernel.org/all/1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com/
Cc: Yang Shi <yang@...amperecomputing.com>
Cc: Ionela Voinescu <Ionela.Voinescu@....com>
Signed-off-by: Prashant Malani <pmalani@...gle.com>
---
drivers/cpufreq/cppc_cpufreq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 4a17162a392d..086c3b87bd4e 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -718,7 +718,7 @@ static int cppc_get_perf_ctrs_sample(int cpu,
if (ret)
return ret;
- udelay(2); /* 2usec delay between sampling */
+ udelay(200); /* 200usec delay between sampling */
return cppc_get_perf_ctrs(cpu, fb_ctrs_t1);
}
--
2.50.1.552.g942d659e1b-goog
Powered by blists - more mailing lists