[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250821042707.62993-1-adamli@os.amperecomputing.com>
Date: Thu, 21 Aug 2025 04:27:05 +0000
From: Adam Li <adamli@...amperecomputing.com>
To: anna-maria@...utronix.de,
frederic@...nel.org,
tglx@...utronix.de,
mingo@...hat.com,
peterz@...radead.org,
juri.lelli@...hat.com,
vincent.guittot@...aro.org,
vschneid@...hat.com
Cc: dietmar.eggemann@....com,
rostedt@...dmis.org,
bsegall@...gle.com,
mgorman@...e.de,
cl@...ux.com,
linux-kernel@...r.kernel.org,
patches@...erecomputing.com,
Adam Li <adamli@...amperecomputing.com>
Subject: [PATCH RESEND 0/2] tick/nohz: CPU cannot enter NOHZ idle balance state
Valentin Schneider suggested to resend this patch and copy to
scheduler reviewers [1].
When running llama on arm64 server, some CPUs *keep* idle while others
are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
is set. The server has 192 CPUs, with kernel option 'nohz_full=0-191'.
The problem is caused by two issues:
1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask'. This bug
is fixed by the first patch in this serial:
"tick/nohz: Fix wrong NOHZ idle CPU state".
2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', no CPU can be
selected to do NOHZ idle load balancing because conditions in
find_new_ilb() is too strict. This issue is fixed by patch in [2].
We can see that the idle CPUs are not in nohz.idle_cpus_mask. The NOHZ
idle load balancing only considers CPUs in nohz.idle_cpus_mask. The ticks
on the idle CPUs are stopped and therefore period load balancing
is not triggered. Therefore the CPUs are not used and the
imbalance persists.
A CPU is added to nohz.idle_cpus_mask in:
do_idle()
-> tick_nohz_idle_stop_tick()
-> nohz_balance_enter_idle()
nohz_balance_enter_idle() depends on '!was_stopped' condition.
It looks 'was_stopped' is used to avoid duplicated calling
nohz_balance_enter_idle() and duplicated setting 'ts->idle_jiffies'.
When the CPU is in nohz_full mode, 'was_stopped' may alwasy be true.
The call path might be:
tick_nohz_full_stop_tick() /* stop tick and set TS_FLAG_STOPPED */
... ...
do_idle()
-> tick_nohz_idle_stop_tick() /* was_stoppped == 1 */
The first patch "Fix wrong NOHZ idle CPU state" makes
nohz_balance_enter_idle() independent of '!was_stopped'. It is safe
since in nohz_balance_enter_idle(), there exists a condition check
'rq->nohz_tick_stopped' to avoid duplicated nohz.idle_cpus_mask setting.
The second patch "Trigger warning when CPU in wrong NOHZ idle state"
is for debug only. It is not intended to be merged. The patch can help
to reproduce the bug.
Warning is triggerred when CPU is in this 'wrong' state:
1) tick was already stopped before tick_nohz_idle_stop_tick()
stops the tick
2) and CPU is not in nohz.idle_cpus_mask
3) and CPU is idle
4) and tick is stopped
When kernel booting on my system there is warning:
[ 15.536604] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:1230 tick_nohz_idle_stop_tick+0x148/0x160
[ 15.550687] Modules linked in:
[ 15.553731] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.17.0-rc1-cls-00002-g39cde4c0206e-dirty #109 VOLUNTARY
[ 15.580390] pstate: 614000c9 (nZCv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
<snip>
[ 15.703028] Call trace:
[ 15.705462] tick_nohz_idle_stop_tick+0x148/0x160 (P)
[ 15.710502] cpuidle_idle_call+0x118/0x1d0
[ 15.714588] do_idle+0xf4/0x100
[ 15.717717] cpu_startup_entry+0x40/0x50
[ 15.721627] secondary_start_kernel+0xe4/0x128
[ 15.732745] __secondary_switched+0xc0/0xc8
After the first patch, CPU is added to nohz.idle_cpus_mask.
NOHZ idle balancing can move task to this CPU.
Adam Li (2):
tick/nohz: Fix wrong NOHZ idle CPU state
tick/nohz: Trigger warning when CPU in wrong NOHZ idle state
Links
[1]: https://lore.kernel.org/all/xhsmho6sagz7p.mognet@vschneid-thinkpadt14sgen2i.remote.csb/
[2]: https://lore.kernel.org/all/20250819025720.14794-1-adamli@os.amperecomputing.com/
include/linux/sched/nohz.h | 2 ++
kernel/sched/fair.c | 5 +++++
kernel/time/tick-sched.c | 8 +++++---
3 files changed, 12 insertions(+), 3 deletions(-)
--
2.34.1
Powered by blists - more mailing lists