[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250211194558.803373-1-sshegde@linux.ibm.com>
Date: Wed, 12 Feb 2025 01:15:58 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: frederic@...nel.org, linux-kernel@...r.kernel.org
Cc: sshegde@...ux.ibm.com, mingo@...nel.org, peterz@...radead.org,
vincent.guittot@...aro.org, maddy@...ux.ibm.com,
dietmar.eggemann@....com, riel@...riel.com
Subject: [RFC] sched/cputime: issue with time accounting using default configs
While experimenting with irq time accounting stumbled upon this issue
with cputime accounting while running simple benchmarks.
This is very likely a common issue across different archs unless one turns
on IRQ_TIME_ACCOUNTING. Took a look at src rpms of rhel and suse. Only
rhel on x86 seems to enable it.
(default configs)
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
all 3.41 0.00 73.81 0.00 22.00 0.00 0.10 0.00 0.00 0.67
all 3.39 0.00 73.30 0.00 22.71 0.01 0.01 0.00 0.00 0.58
(With CONFIG_IRQ_TIME_ACCOUNTING=y)
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_IRQ_TIME_ACCOUNTING=y
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
all 3.64 0.00 94.26 0.00 1.77 0.06 0.05 0.00 0.00 0.21
all 3.42 0.00 93.89 0.00 1.94 0.07 0.00 0.00 0.00 0.68
Forced NATIVE to be enabled by removing conditional check in NO_HZ_FULL.
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
all 5.78 0.00 92.55 0.00 1.56 0.00 0.00 0.00 0.00 0.11
all 6.14 0.00 91.86 0.00 1.68 0.02 0.00 0.00 0.00 0.29
Given the code, NATIVE accounting seems most accurate,
since it tracks enter/exit of user, hardirq, softirqs.
Though it comes with its own overhead.
Such a drastic difference w.r.t to *irq time*. That made me wonder why?
This happens because of when NO_HZ_FULL is chosen, NATIVE accounting
cannot be enabled and GENeric is the option.
GEN -> account_process_tick ->
-> if context tracking is enabled, do accounting based on it.
-> if irq_time accounting is enabled, do that.
-> If not, fall back to simple tick based accounting. With this
whole tick duration can be attributed to IRQ. Which is not true.
NATIVE -> account_process_tick ->
vtime_flush - native based accounting.
The main concern is, context tracking is enabled only if NO_HZ_FULL=y and
(nohz_full= or isolcpus=) is set. Most of the kernels are built with
NO_HZ_FULL, but many may not pass the nohz_full=. (correct me if i am
wrong). This leads to context tracking isn't being enabled. Since irq
time isn't enabled either, it falls into simple tick based accounting.
A few ways to fix. Some may not be sane. These are the hacks that i have
tried.
1. Looking at irq_time vs native accounting, seems like irq_time is
lightweight and close enough to native. maybe that can be a middle
ground. So enable it for the arch default configs. That way distros can
enable it. below patch is with this method.
NOTE: this needs more work still w.r.t to measuring the overhead.
2. Select IRQ_TIME_ACCOUNTING in case of NO_HZ_FULL. This would fix this
accounting issue for all archs. But given a slight overhead, some archs
may not want it.
3. If context tracking is not enabled, then do native way if archs
supports it. since native and irq_time are exclusive only one of them
can be enabled. This needs a lot of change given how the current code is
with macros. Also this meant decoupling native from NO_HZ_FULL.
Is this a problem worth fixing? are there any better way to fix it?
Signed-off-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
---
arch/powerpc/configs/ppc64_defconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig
index 465eb96c755e..9bc678d92384 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -3,6 +3,7 @@ CONFIG_POSIX_MQUEUE=y
CONFIG_AUDIT=y
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ=y
+CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
--
2.39.3
Powered by blists - more mailing lists