[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250817191825.ef254428d688d987333d4f4e@linux-foundation.org>
Date: Sun, 17 Aug 2025 19:18:25 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: yaozhenguo <yaozhenguo1@...il.com>
Cc: tglx@...utronix.de, yaoma@...ux.alibaba.com, max.kellermann@...os.com,
lihuafei1@...wei.com, yaozhenguo@...com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] watchdog/softlockup:Fix incorrect CPU utilization
output during softlockup
On Tue, 12 Aug 2025 16:25:10 +0800 yaozhenguo <yaozhenguo1@...il.com> wrote:
> From: ZhenguoYao <yaozhenguo1@...il.com>
>
> Since we use 16-bit precision, the raw data will undergo
> integer division, which may sometimes result in data loss.
> This can lead to slightly inaccurate CPU utilization calculations.
> Under normal circumstances, this isn’t an issue. However,
> when CPU utilization reaches 100%, the calculated result might
> exceed 100%. For example, with raw data like the following:
>
> sample_period 400000134 new_stat 83648414036 old_stat 83247417494
>
> sample_period=400000134/2^24=23
> new_stat=83648414036/2^24=4985
> old_stat=83247417494/2^24=4961
> util=105%
>
> Below log will output:
>
> CPU#3 Utilization every 0s during lockup:
> #1: 0% system, 0% softirq, 105% hardirq, 0% idle
> #2: 0% system, 0% softirq, 105% hardirq, 0% idle
> #3: 0% system, 0% softirq, 100% hardirq, 0% idle
> #4: 0% system, 0% softirq, 105% hardirq, 0% idle
> #5: 0% system, 0% softirq, 105% hardirq, 0% idle
>
> To avoid confusion, we enforce a 100% display cap when
> calculations exceed this threshold.
>
> ...
>
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -444,6 +444,13 @@ static void update_cpustat(void)
> old_stat = __this_cpu_read(cpustat_old[i]);
> new_stat = get_16bit_precision(cpustat[tracked_stats[i]]);
> util = DIV_ROUND_UP(100 * (new_stat - old_stat), sample_period_16);
> + /* Since we use 16-bit precision, the raw data will undergo
/*
* Since ...
please.
> + * integer division, which may sometimes result in data loss,
> + * and then result might exceed 100%. To avoid confusion,
> + * we enforce a 100% display cap when calculations exceed this threshold.
> + */
> + if (util > 100)
> + util = 100;
> __this_cpu_write(cpustat_util[tail][i], util);
> __this_cpu_write(cpustat_old[i], new_stat);
> }
Can we do something to make this output more accurate? For example,
return (data_ns + (1 << 23)) >> 24LL;
would round to the nearest multiple of 16.8ms?
Powered by blists - more mailing lists