[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251023121213.38282-1-jiahao.kernel@gmail.com>
Date: Thu, 23 Oct 2025 20:12:13 +0800
From: Hao Jia <jiahao.kernel@...il.com>
To: mingo@...hat.com,
peterz@...radead.org,
mingo@...nel.org,
juri.lelli@...hat.com,
vincent.guittot@...aro.org,
dietmar.eggemann@....com,
rostedt@...dmis.org,
bsegall@...gle.com,
mgorman@...e.de,
vschneid@...hat.com,
kprateek.nayak@....com
Cc: linux-kernel@...r.kernel.org,
Hao Jia <jiahao1@...iang.com>,
Aaron Lu <ziqianlu@...edance.com>
Subject: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
From: Hao Jia <jiahao1@...iang.com>
Aaron Lu and I hit a non-empty throttled_limbo_list warning in
tg_throttle_down() during testing.
WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
Call Trace:
? __pfx_tg_nop+0x10/0x10
walk_tg_tree_from+0x39/0xd0
? __pfx_tg_throttle_down+0x10/0x10
throttle_cfs_rq+0x176/0x1f0
enqueue_task_fair+0x4f5/0xd30
? unthrottle_cfs_rq+0x2f7/0x3a0
tg_unthrottle_up+0x10c/0x2f0
? __pfx_tg_unthrottle_up+0x10/0x10
walk_tg_tree_from+0x66/0xd0
? __pfx_tg_nop+0x10/0x10
unthrottle_cfs_rq+0x16b/0x3a0
__cfsb_csd_unthrottle+0x1f0/0x250
? __pfx___cfsb_csd_unthrottle+0x10/0x10
__flush_smp_call_function_queue+0x104/0x440
? tick_nohz_account_idle_time+0x4c/0x80
flush_smp_call_function_queue+0x3b/0x80
do_idle+0x14f/0x240
cpu_startup_entry+0x30/0x40
start_secondary+0x128/0x160
common_startup_64+0x13e/0x141
cgroup hierarchy:
root
/ \
A* ...
/ | \ ...
B* ...
Debugging shows the following:
A and B are configured with relatively small quota and large period.
At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.
Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
tg_set_cfs_bandwidth().
Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
and runtime_remaining stays 0. With task_group B has a small quota,
tasks on other CPUs in task_group B quickly consume all of
cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.
When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
the task will be throttled in
enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
triggering a non-empty throttled_limbo_list warning in tg_throttle_down().
Root Cause:
In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
level be greater than 0.
Solution:
One way to fix this warning is to add a runtime_remaining check for
each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
strange and complicated.
Another straightforward and simple solution is to add a new enqueue flag
to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
throttling. This may enqueue sched_entity with no remaining runtime, which
is not a big deal because the current kernel also has such situations [1].
We still retain the runtime_remaining check in unthrottle_cfs_rq() for
higher-level cfs_rq to avoid enqueuing many entities with
runtime_remaining < 0.
Also remove the redundant assignment to se in tg_throttle_down().
[1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reported-by: Aaron Lu <ziqianlu@...edance.com>
Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
Signed-off-by: Hao Jia <jiahao1@...iang.com>
---
Reproduction steps:
(1) Create a test.sh script and run it:
#!/bin/bash
set -e
CGP_ROOT="/sys/fs/cgroup/"
if ! mount | grep -q "type cgroup2"; then
echo "ERROR: Not in cgroup v2 mode."
exit 1
fi
echo 1 > /sys/kernel/debug/clear_warn_once
mkdir -p ${CGP_ROOT}/test
echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
for i in $(seq 1 2); do
SUB_DIR=${CGP_ROOT}/test/sub$i
mkdir -p ${SUB_DIR}
echo $$ > ${SUB_DIR}/cgroup.procs
perf bench sched messaging -g 10 -t -l 50000 &
echo $$ > /sys/fs/cgroup/cgroup.procs
echo "1000 100000" > ${SUB_DIR}/cpu.max
echo "Started stress in stress_sub$i"
done
echo "1000 100000" > ${CGP_ROOT}/test/cpu.max
(2) Run the following command multiple times until the warning is triggered:
echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max
kernel/sched/fair.c | 15 ++++++++++-----
kernel/sched/sched.h | 3 +++
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58e696d7ccc..7721466dc8f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 1;
if (cfs_rq->nr_queued == 1) {
- check_enqueue_throttle(cfs_rq);
+ if (!(flags & ENQUEUE_THROTTLE))
+ check_enqueue_throttle(cfs_rq);
+
list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
if (cfs_rq->pelt_clock_throttled) {
@@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
list_del_init(&p->throttle_node);
p->throttled = false;
- enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+ enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
}
/* Add cfs_rq with load or one or more already running entities to the list */
@@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
* unthrottled us with a positive runtime_remaining but other still
* running entities consumed those runtime before we reached here.
*
- * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+ * We can't unthrottle this cfs_rq without any runtime remaining
* because any enqueue in tg_unthrottle_up() will immediately trigger a
* throttle, which is not supposed to happen on unthrottle path.
+ *
+ * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
+ * tg_unthrottle_up() do not trigger a throttle, we still retain the check
+ * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
+ * entities whose runtime_remaining is less than 0.
*/
if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
return;
- se = cfs_rq->tg->se[cpu_of(rq)];
-
cfs_rq->throttled = 0;
update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e7718f12bc55..6f45e00d1fc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2368,6 +2368,8 @@ extern const u32 sched_prio_to_wmult[40];
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
* ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
+ * ENQUEUE_THROTTLE - Called in tg_unthrottle_up() to ensure that
+ * task can be enqueued during unthrottle
*
* XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
* SCHED_DEADLINE seems to rely on this for now.
@@ -2399,6 +2401,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATED 0x00040000
#define ENQUEUE_INITIAL 0x00080000
#define ENQUEUE_RQ_SELECTED 0x00100000
+#define ENQUEUE_THROTTLE 0x00200000
#define RETRY_TASK ((void *)-1UL)
--
2.34.1
Powered by blists - more mailing lists