linux-kernel - Re: [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait fluctuation when cpu hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aTryiIrwL4WJZKPt@localhost.localdomain>
Date: Thu, 11 Dec 2025 17:34:16 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Xin Zhao <jackzxcui1989@....com>
Cc: anna-maria@...utronix.de, mingo@...nel.org, tglx@...utronix.de,
	kuba@...nel.org, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v2 2/2] timers/nohz: Avoid /proc/stat idle/iowait
 fluctuation when cpu hotplug

Hi Xin,

Le Wed, Dec 10, 2025 at 04:31:35PM +0800, Xin Zhao a écrit :
> The idle and iowait statistics in /proc/stat are obtained through
> get_idle_time() and get_iowait_time(). Assuming CONFIG_NO_HZ_COMMON is
> enabled, when CPU is online, the idle and iowait values use the
> idle_sleeptime and iowait_sleeptime statistics from tick_cpu_sched, but
> use CPUTIME_IDLE and CPUTIME_IOWAIT items from kernel_cpustat when CPU
> is offline. Although /proc/stat do not print statistics of offline CPU,
> it still print aggregated statistics of all possible CPUs.
> 
> tick_cpu_sched and kernel_cpustat are maintained by different logic,
> leading to a significant gap. The first line of the data below shows the
> /proc/stat output when only one CPU remains after CPU offline, the second
> line shows the /proc/stat output after all CPUs are brought back online:
> 
> cpu  2408558 2 916619 4275883 5403 123758 64685 0 0 0
> cpu  2408588 2 916693 4200737 4184 123762 64686 0 0 0
> 
> Obviously, other values do not experience significant fluctuations, while
> idle/iowait statistics show a substantial decrease, which make system CPU
> monitoring troublesome.
> 
> get_cpu_idle_time_us() calculates the latest cpu idle time based on
> idle_entrytime and current time. When CPU is idle when offline, the value
> return by get_cpu_idle_time_us() will continue to increase, which is
> unexpected. get_cpu_iowait_time_us() has the similar calculation logic.
> When CPU is in the iowait state when offline, the value return by
> get_cpu_iowait_time_us() will continue to increase.
> 
> Introduce get_cpu_idle_time_us_offline() as the _offline variants of
> get_cpu_idle_time_us(). get_cpu_idle_time_us_offline() just return the
> same value of idle_sleeptime without any calculation. In this way,
> /proc/stat logic can use it to get a correct CPU idle time, which remains
> unchanged during CPU offline period. Also, the aggregated statistics of
> all possible CPUs printed by /proc/stat will not experience significant
> fluctuation when CPU hotplug.
> So as the newly added get_cpu_iowait_time_us_offline().
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@....com>

So the problem is a bit deeper (warning: lots of bullets)

First of all you shouldn't need get_cpu_iowait_time_us_offline().
get_cpu_idle_time_us() is supposed to stop advancing after
tick_sched_timer_dying().

But since this code:

	if (cpu_is_offline(cpu)) {
		cpuhp_report_idle_dead();
		arch_cpu_idle_dead();
	}

is placed after tick_nohz_idle_enter() in do_idle(), the idle time
moves forward as long as the CPU is offline.

In fact offline handling in idle should happen at the very beginning
of do_idle because:

* There is no tick handling to do
* There is no nohz idle balancing to do
* And polling to TIF_RESCHED is useless
* No need to check if need_resched() before offline handling since
  stop_machine is done and all per-cpu kthread should be done with
  their job.

So:

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd17..35d79af3286d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -260,6 +260,12 @@ static void do_idle(void)
 {
 	int cpu = smp_processor_id();
 
+	if (cpu_is_offline(cpu)) {
+		local_irq_disable();
+		cpuhp_report_idle_dead();
+		arch_cpu_idle_dead();
+	}
+
 	/*
 	 * Check if we need to update blocked load
 	 */
@@ -311,11 +317,6 @@ static void do_idle(void)
 		 */
 		local_irq_disable();
 
-		if (cpu_is_offline(cpu)) {
-			cpuhp_report_idle_dead();
-			arch_cpu_idle_dead();
-		}
-
 		arch_cpu_idle_enter();
 		rcu_nocb_flush_deferred_wakeup();
 

But tick_sched_timer_dying() temporarily clears the idle/iowait time.
It's fixable but that's not all. We still have two idle time accounting
with each having their shortcomings:

* The accounting for online CPUs which is based on delta between
  tick_nohz_start_idle() and tick_nohz_stop_idle().

  Pros:
       - Works when the tick is off

       - Has nsecs granularity
  Cons:
       - Ignore steal time and possibly spuriously substract it from later
         system accounting.

       - Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but the
         idle irqtime is possibly spuriously substracted from later system
         accounting.

       - Is not accurate when CONFIG_IRQ_TIME_ACCOUNTING=n

       - The windows between 1) idle task scheduling and the first call to
         tick_nohz_start_idle() and 2) idle task between the last
         tick_nohz_stop_idle() and the rest of the idle time are blindspots
	 wrt. cputime accounting.

* The accounting for offline CPUs which is based on ticks and the
  jiffies delta during which the tick was stopped.

  Pros:
       - Handles steal time correctly

       - Handle CONFIG_IRQ_TIME_ACCOUNTING=y and CONFIG_IRQ_TIME_ACCOUNTING=n
         correctly.

       - Handles the whole idle task

   Cons:
       - Doesn't elapse when the tick is off

       - Has TICK_NSEC granularity (jiffies)

       - Needs to track the idle ticks that were accounted and substract
         them from the total jiffies time spent while the tick was stopped.
	 This is ugly.
	 

So what I think we should do is having a single one solution always applying
to cpustat which does a hybrid approach with better support:

* Tick based accounting whenever the tick isn't stopped.

* When the tick is stopped, account like we do between tick_nohz_start_idle()
  and tick_nohz_stop_idle() (but only when the tick is actually stopped) and
  handle CONFIG_IRQ_TIME_ACCOUNTING correctly such that:

  - If y, stop on IRQ entry and restart on IRQ exit like we used to. It means
    we must flush cpu_irqtime::tick_delta upon tick restart so that it's not
    substracted later from system accounting.

  - If n, keep accounting on IRQs (remove calls to tick_nohz_start_idle()
    and tick_nohz_stop_idle() during IRQs)

  During that tick stopped time, tick based accounting must be ignored.

  Idle steal delta time must be recorded at tick stop/tick restart boundaries
  and substracted from later idle time accounting.

  This logic should be moved to vtime and we still need the offlining fix above.

Just give me a few days to work on that patchset.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs