[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJvTdK=_v9q2eGMB6qG3iaDhXMzQHz-EJ4NeDEfBe2fbv+wKfQ@mail.gmail.com>
Date: Tue, 25 Nov 2025 13:13:54 -0500
From: Len Brown <lenb@...nel.org>
To: Emily Ehlert <ehemily@...zon.de>
Cc: linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
Emily Ehlert <ehemily@...zon.com>
Subject: Re: [PATCH 2/2] tools/power/turbostat: Fix division by zero when TDP
calculation fails
Can you share what system configuration went down this error path?
I don't have a good feeling about this remedy.
The reason we get TDP is to estimate if RAPL will overflow its
undersized energy counters during the measurement interval. We arm a
timer appropriately to accumulate RAPL if it looks like this will be
possible.
It is possible to run into that underlying issue, but it is not common usage.
So my question is if you have a system where RAPL works fine, but we
simply can't get TDP to build this safety net?
If so, it may make more sense to adjust the safety net, say, using
some conservative default timer interval, rather than not report RAPL
at all.
And if we do opt to disable RAPL, I'd rather uniformly use the
existing build-in-counter macros for testing if a counter is actually
present and enabled instead of a new global flag.
thanks,
-Len
On Thu, Nov 13, 2025 at 2:17 PM Emily Ehlert <ehemily@...zon.de> wrote:
>
> From: Emily Ehlert <ehemily@...zon.com>
>
> turbostat uses hard coded features for CPU families and expects access to
> RAPL (Running Average Power Limit) MSRs. When RAPL or power info is not
> available, turbostat reads PKG_POWER_INFO MSR to calculate TDP with
> RAPL_POWER_UNIT MSR. If TDP calculation results in 0, no zero check is
> performed and the 0 TDP is used in division, resulting in an invalid
> rapl_joule_counter_range. This variable is later used in msr_sum_record()
> as a timer parameter to timer_settime() syscall, causing issues.
>
> Fix the issue by:
>
> - Introduce zero check for tdp in rapl_probe_intel() and rapl_probe_amd()
> - Introduce global variable no_rapl which is set to true if zero check fails
> - Skip RAPL-dependent functions when no_rapl is true
> - Add assertions and guards to prevent RAPL operations when disabled
>
> Signed-off-by: Emily Ehlert <ehemily@...zon.com>
> ---
> tools/power/x86/turbostat/turbostat.c | 27 ++++++++++++++++++++++++---
> 1 file changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
> index 9a2be201a3a6..9c6ee0acbe12 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -531,6 +531,7 @@ static struct timeval procsysfs_tv_begin;
> int ignore_stdin;
> bool no_msr;
> bool no_perf;
> +bool no_rapl;
>
> enum gfx_sysfs_idx {
> GFX_rc6,
> @@ -3119,6 +3120,10 @@ int dump_counters(PER_THREAD_PARAMS)
>
> double rapl_counter_get_value(const struct rapl_counter *c, enum rapl_unit desired_unit, double interval)
> {
> + if (no_rapl) {
> + return NAN;
> + }
> +
> assert(desired_unit != RAPL_UNIT_INVALID);
>
> /*
> @@ -4696,6 +4701,8 @@ static size_t cstate_counter_info_count_perf(const struct cstate_counter_info_t
>
> void write_rapl_counter(struct rapl_counter *rc, struct rapl_counter_info_t *rci, unsigned int idx)
> {
> + assert(!no_rapl);
> +
> if (rci->source[idx] == COUNTER_SOURCE_NONE)
> return;
>
> @@ -4706,6 +4713,8 @@ void write_rapl_counter(struct rapl_counter *rc, struct rapl_counter_info_t *rci
>
> int get_rapl_counters(int cpu, unsigned int domain, struct core_data *c, struct pkg_data *p)
> {
> + assert(!no_rapl);
> +
> struct platform_counters *pplat_cnt = p == package_odd ? &platform_counters_odd : &platform_counters_even;
> unsigned long long perf_data[NUM_RAPL_COUNTERS + 1];
> struct rapl_counter_info_t *rci;
> @@ -5147,7 +5156,7 @@ int get_counters(PER_THREAD_PARAMS)
> if (!is_cpu_first_thread_in_core(t, c, p))
> goto done;
>
> - if (platform->has_per_core_rapl) {
> + if (platform->has_per_core_rapl && !no_rapl) {
> status = get_rapl_counters(cpu, get_rapl_domain_id(cpu), c, p);
> if (status != 0)
> return status;
> @@ -5213,7 +5222,7 @@ int get_counters(PER_THREAD_PARAMS)
> if (DO_BIC(BIC_SYS_LPI))
> p->sys_lpi = cpuidle_cur_sys_lpi_us;
>
> - if (!platform->has_per_core_rapl) {
> + if (!platform->has_per_core_rapl && !no_rapl) {
> status = get_rapl_counters(cpu, get_rapl_domain_id(cpu), c, p);
> if (status != 0)
> return status;
> @@ -7650,6 +7659,12 @@ void rapl_probe_intel(void)
>
> tdp = get_tdp_intel();
>
> + if (tdp == 0.0) {
> + no_rapl = true;
> + fprintf(outf, "RAPL: Could not calculate TDP (TDP: %.0f, MSR_RAPL_POWER_UNIT: %llx)\n", tdp, msr);
> + return;
> + }
> +
> rapl_joule_counter_range = 0xFFFFFFFF * rapl_energy_units / tdp;
> if (!quiet)
> fprintf(outf, "RAPL: %.0f sec. Joule Counter Range, at %.0f Watts\n", rapl_joule_counter_range, tdp);
> @@ -7680,6 +7695,12 @@ void rapl_probe_amd(void)
>
> tdp = get_tdp_amd();
>
> + if (tdp == 0.0) {
> + no_rapl = true;
> + fprintf(outf, "RAPL: Could not calculate TDP (TDP: %.0f, MSR_RAPL_POWER_UNIT: %llx)\n", tdp, msr);
> + return;
> + }
> +
> rapl_joule_counter_range = 0xFFFFFFFF * rapl_energy_units / tdp;
> if (!quiet)
> fprintf(outf, "RAPL: %.0f sec. Joule Counter Range, at %.0f Watts\n", rapl_joule_counter_range, tdp);
> @@ -11215,7 +11236,7 @@ int main(int argc, char **argv)
>
> turbostat_init();
>
> - if (!no_msr)
> + if (!no_msr && !no_rapl)
> msr_sum_record();
>
> /* dump counters and exit */
> --
> 2.47.3
>
>
>
>
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
>
--
Len Brown, Intel Open Source Technology Center
Powered by blists - more mailing lists