linux-kernel - Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8252b1e6-5d13-4f26-8aa3-30e841639e10@os.amperecomputing.com>
Date: Tue, 5 Aug 2025 17:26:33 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Prashant Malani <pmalani@...gle.com>,
 open list <linux-kernel@...r.kernel.org>,
 "open list:CPU FREQUENCY SCALING FRAMEWORK" <linux-pm@...r.kernel.org>,
 "Rafael J. Wysocki" <rafael@...nel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>, beata.michalska@....com,
 Catalin Marinas <catalin.marinas@....com>
Cc: Ionela Voinescu <Ionela.Voinescu@....com>
Subject: Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads



On 7/30/25 3:07 PM, Prashant Malani wrote:
> [You don't often get email from pmalani@...gle.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On a heavily loaded CPU, performance counter reads can be erratic. This is
> due to two factors:
> - The method used to calculate CPPC delivered performance.
> - Linux scheduler vagaries.
>
> As an example, on a CPU which has a max frequency of 3.4 GHz, if we run
> stress-ng on the CPU in the background and then read the frequency, we get
> invalid readings:
>
> ./stress_ng --cpu 108 --taskset 3 -t 30s &
> cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq
> 3600000
> 3500000
> 3600000
>
> Per [1] CPPC performance is measured by reading the delivered and reference
> counters at timestamp t0, then waiting 2us, and then repeating the
> measurement at t1. So, in theory, one should end up with:
> Timestamp t0: ref0, del0
> Timestamp t1: ref1, del1
>
> However, since the reference and delivered registers are individual
> register reads (in the case of FFH, it even results in an IPI to the CPU in
> question), what happens in practice is:
> Timestamp t0: del0
> Timestamp t0 + m: ref0
> Timestamp t1: del1
> Timestamp t1 + n: ref1
>
> There has been prior discussion[2] about the cause of these differences;
> it was broadly pegged as due to IRQs and "interconnect congestion".
>
> Since the gap between t0 and t1 is very small (2us), differing values of m
> and n mean that the measurements don't correspond to 2 discrete timestamps,
> since the delivered performance delta is being measured across a
> significantly different time period than the reference performance
> delta. This has an influence on the perf measurement which is:
> ((del1 - del0) * reference perf) / (ref1 - ref0)
>
> Previously collected data[4] shows that cppc_get_perf_ctrs() itself
> takes anywhere between 4.9us and 3.6us, which further suggests that a
> 2us delta is too less.
>
> If we increase the time delta to a high enough value (i.e if delay >> m,n)
> then the effects of m and n get mitigated, leading to both the register
> measurements (ref and del) corresponding to the same timestamp.
>
> When this approach was previously proposed[3], there was concern about
> this function being called with interrupts off but that was later found to
> be not true [2]. So, waiting for a slightly longer time in between counter
> samples should be acceptable.
>
> Increase the time delay between counter reads to 200 us to reduce the
> effect of timing discrepancies in reading individual performance registers.
>
> [1] https://docs.kernel.org/admin-guide/acpi/cppc_sysfs.html#computing-average-delivered-performance
> [2] https://lore.kernel.org/all/7b57e680-0ba3-0b8b-851e-7cc369050386@os.amperecomputing.com/
> [3] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
> [4] https://lore.kernel.org/all/1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com/
>
> Cc: Yang Shi <yang@...amperecomputing.com>
> Cc: Ionela Voinescu <Ionela.Voinescu@....com>
> Signed-off-by: Prashant Malani <pmalani@...gle.com>

Thank you for cc'ing me the patch. I posted the similar patch ago and 
had some discussion on the mailing list. Then someone else from ARM 
pursued a different way to solve it. But I didn't follow very closely. 
If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq 
was added. It should be the preferred way to get cpu frequency. Please 
see 
https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.

Added Beata Michalska in the loop too, who is the author of the patch. 
Please feel free to correct me, if I'm wrong.

Yang


> ---
>   drivers/cpufreq/cppc_cpufreq.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 4a17162a392d..086c3b87bd4e 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -718,7 +718,7 @@ static int cppc_get_perf_ctrs_sample(int cpu,
>          if (ret)
>                  return ret;
>
> -       udelay(2); /* 2usec delay between sampling */
> +       udelay(200); /* 200usec delay between sampling */
>
>          return cppc_get_perf_ctrs(cpu, fb_ctrs_t1);
>   }
> --
> 2.50.1.552.g942d659e1b-goog
>