[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aTryiIrwL4WJZKPt@localhost.localdomain>
Date: Thu, 11 Dec 2025 17:34:16 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Xin Zhao <jackzxcui1989@....com>
Cc: anna-maria@...utronix.de, mingo@...nel.org, tglx@...utronix.de,
kuba@...nel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait
fluctuation when cpu hotplug
Hi Xin,
Le Wed, Dec 10, 2025 at 04:31:35PM +0800, Xin Zhao a écrit :
> The idle and iowait statistics in /proc/stat are obtained through
> get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
> enabled, when CPU is online, the idle and iowait values use the
> idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> is offline. Although /proc/stat do not print statistics of offline CPU,
> it still print aggregated statistics of all possible CPUs.
>
> tick_cpu_sched and kernel_cpustat are maintained by different logic,
> leading to a significant gap. The first line of the data below shows the
> /proc/stat output when only one CPU remains after CPU offline, the second
> line shows the /proc/stat output after all CPUs are brought back online:
>
> cpu 2408558 2 916619 4275883 5403 123758 64685 0 0 0
> cpu 2408588 2 916693 4200737 4184 123762 64686 0 0 0
>
> Obviously, other values do not experience significant fluctuations, while
> idle/iowait statistics show a substantial decrease, which make system CPU
> monitoring troublesome.
>
> get_cpu_idle_time_us() calculates the latest cpu idle time based on
> idle_entrytime and current time. When CPU is idle when offline, the value
> return by get_cpu_idle_time_us() will continue to increase, which is
> unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
> When CPU is in the iowait state when offline, the value return by
> get_cpu_iowait_time_us() will continue to increase.
>
> Introduce get_cpu_idle_time_us_offline() as the _offline variants of
> get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
> same value of idle_sleeptime without any calculation. In this way,
> /proc/stat logic can use it to get a correct CPU idle time, which remains
> unchanged during CPU offline period. Also, the aggregated statistics of
> all possible CPUs printed by /proc/stat will not experience significant
> fluctuation when CPU hotplug.
> So as the newly added get_cpu_iowait_time_us_offline().
>
> Signed-off-by: Xin Zhao <jackzxcui1989@....com>
So the problem is a bit deeper (warning: lots of bullets)
First of all you shouldn't need get_cpu_iowait_time_us_offline().
get_cpu_idle_time_us() is supposed to stop advancing after
tick_sched_timer_dying().
But since this code:
if (cpu_is_offline(cpu)) {
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
is placed after tick_nohz_idle_enter() in do_idle(), the idle time
moves forward as long as the CPU is offline.
In fact offline handling in idle should happen at the very beginning
of do_idle because:
* There is no tick handling to do
* There is no nohz idle balancing to do
* And polling to TIF_RESCHED is useless
* No need to check if need_resched() before offline handling since
stop_machine is done and all per-cpu kthread should be done with
their job.
So:
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..35d79af3286d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,12 @@ static void do_idle(void)
{
int cpu = smp_processor_id();
+ if (cpu_is_offline(cpu)) {
+ local_irq_disable();
+ cpuhp_report_idle_dead();
+ arch_cpu_idle_dead();
+ }
+
/*
* Check if we need to update blocked load
*/
@@ -311,11 +317,6 @@ static void do_idle(void)
*/
local_irq_disable();
- if (cpu_is_offline(cpu)) {
- cpuhp_report_idle_dead();
- arch_cpu_idle_dead();
- }
-
arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
But tick_sched_timer_dying() temporarily clears the idle/iowait time.
It's fixable but that's not all. We still have two idle time accounting
with each having their shortcomings:
* The accounting for online CPUs which is based on delta between
tick_nohz_start_idle() and tick_nohz_stop_idle().
Pros:
- Works when the tick is off
- Has nsecs granularity
Cons:
- Ignore steal time and possibly spuriously substract it from later
system accounting.
- Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the
idle irqtime is possibly spuriously substracted from later system
accounting.
- Is not accurate when CONFIG_IRQ_TIME_ACCOUNTING=n
- The windows between 1) idle task scheduling and the first call to
tick_nohz_start_idle() and 2) idle task between the last
tick_nohz_stop_idle() and the rest of the idle time are blindspots
wrt. cputime accounting.
* The accounting for offline CPUs which is based on ticks and the
jiffies delta during which the tick was stopped.
Pros:
- Handles steal time correctly
- Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n
correctly.
- Handles the whole idle task
Cons:
- Doesn't elapse when the tick is off
- Has TICK_NSEC granularity (jiffies)
- Needs to track the idle ticks that were accounted and substract
them from the total jiffies time spent while the tick was stopped.
This is ugly.
So what I think we should do is having a single one solution always applying
to cpustat which does a hybrid approach with better support:
* Tick based accounting whenever the tick isn't stopped.
* When the tick is stopped, account like we do between tick_nohz_start_idle()
and tick_nohz_stop_idle() (but only when the tick is actually stopped) and
handle CONFIG_IRQ_TIME_ACCOUNTING correctly such that:
- If y, stop on IRQ entry and restart on IRQ exit like we used to. It means
we must flush cpu_irqtime::tick_delta upon tick restart so that it's not
substracted later from system accounting.
- If n, keep accounting on IRQs (remove calls to tick_nohz_start_idle()
and tick_nohz_stop_idle() during IRQs)
During that tick stopped time, tick based accounting must be ignored.
Idle steal delta time must be recorded at tick stop/tick restart boundaries
and substracted from later idle time accounting.
This logic should be moved to vtime and we still need the offlining fix above.
Just give me a few days to work on that patchset.
Thanks.
--
Frederic Weisbecker
SUSE Labs
Powered by blists - more mailing lists