[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250926081918.30488-1-kprateek.nayak@amd.com>
Date: Fri, 26 Sep 2025 08:19:17 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Matteo Martelli
<matteo.martelli@...ethink.co.uk>, Aaron Lu <ziqianlu@...edance.com>,
<linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
<kprateek.nayak@....com>
Subject: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().
The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.
Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:
root(#)
/ \
A(#) B(*)
|
C <--- new cgroup
|
D <--- new cgroup
# - Already on leaf cfs_rq list
* - Throttled with PELT frozen
The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:
1. online_fair_sched_group() for cgroup D will call
propagate_entity_cfs_rq(). (Same can happen if a throttled task is
moved to cgroup C and enqueue_task_fair() returns early.)
propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
"rq->tmp_alone_branch" since its PELT clock is not marked throttled
and cfs_rq of cgroup B is not on the list.
cfs_rq of cgroup B is skipped since its PELT is throttled.
root cfs_rq already exists on cfs_rq leading to
list_add_leaf_cfs_rq() returning early.
The cfs_rq of cgroup C is left dangling on the
"rq->tmp_alone_branch".
2. A new task wakes up on cgroup A. Since the whole hierarchy is already
on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
without any modifications to "rq->tmp_alone_branch".
The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
!!! Splat !!!
Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.
Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.
While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.
Suggested-by: Aaron Lu <ziqianlu@...edance.com>
Reported-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Stress test included running sched-messaging in nested hierarchy with
various quota set alongside a continuous loop of cgroup creation and
deletion, as well as another loop of continuous movement of a busy loop
between cgroups.
No splats have been observed yet with this patch.
Aaron, Matteo,
I've not added any "Tested-by" tags since the final diff is slightly
different from the diff shared previously. The previous diff led to this
splat during my test with the newly added assert:
------------[ cut here ]------------
WARNING: CPU: 6 PID: 22912 at kernel/sched/fair.c:400 propagate_entity_cfs_rq+0x1f9/0x270
CPU: 6 UID: 0 PID: 22912 Comm: bash Not tainted 6.17.0-rc4-test+ #50 PREEMPT_{RT,(full)}
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:propagate_entity_cfs_rq+0x1f9/0x270
Code: ...
RSP: 0018:ffffd4a203713928 EFLAGS: 00010087
RAX: ffff8ea4714b1e00 RBX: 0000000000000000 RCX: 0000000000000001
RDX: ffff8ea4714b28e8 RSI: 0000003570150cf8 RDI: ffff8e65abc69400
RBP: ffff8ea4714b1f00 R08: 0000000000006160 R09: 0000000000016d5a
R10: 000000000000009c R11: 0000000000000000 R12: ffff8ea4714b1e00
R13: ffff8e6586df1a00 R14: 0000000000000001 R15: 0000000000000246
FS: 00007f73a6a7f740(0000) GS:ffff8e64d0689000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000562b9791d461 CR3: 000000010d875003 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
<TASK>
sched_move_task+0xc2/0x280
cpu_cgroup_attach+0x37/0x70
cgroup_migrate_execute+0x3b8/0x560
? srso_alias_return_thunk+0x5/0xfbef5
? rt_spin_unlock+0x60/0xa0
cgroup_attach_task+0x15b/0x200
? cgroup_attach_permissions+0x17e/0x1e0
__cgroup_procs_write+0x105/0x170
cgroup_procs_write+0x17/0x30
kernfs_fop_write_iter+0x127/0x1c0
vfs_write+0x305/0x420
ksys_write+0x6b/0xe0
do_syscall_64+0x85/0xb30
? __x64_sys_close+0x3d/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0xbe/0xb30
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_fcntl+0x80/0x110
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0xbe/0xb30
? srso_alias_return_thunk+0x5/0xfbef5
? __x64_sys_fcntl+0x80/0x110
? srso_alias_return_thunk+0x5/0xfbef5
? rt_spin_unlock+0x60/0xa0
? srso_alias_return_thunk+0x5/0xfbef5
? wp_page_reuse.constprop.0+0x7c/0x90
? srso_alias_return_thunk+0x5/0xfbef5
? do_wp_page+0x3df/0xd70
? set_close_on_exec+0x31/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? rt_spin_unlock+0x60/0xa0
? srso_alias_return_thunk+0x5/0xfbef5
? do_fcntl+0x3f6/0x760
? srso_alias_return_thunk+0x5/0xfbef5
? do_syscall_64+0xbe/0xb30
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? __handle_mm_fault+0x375/0x380
? __x64_sys_fcntl+0x80/0x110
? srso_alias_return_thunk+0x5/0xfbef5
? count_memcg_events+0xd9/0x1c0
? srso_alias_return_thunk+0x5/0xfbef5
? handle_mm_fault+0x1af/0x290
? srso_alias_return_thunk+0x5/0xfbef5
? do_user_addr_fault+0x2d0/0x8c0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f73a6b968f7
Code: ...
RSP: 002b:00007ffebcfdc688 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f73a6b968f7
RDX: 0000000000000006 RSI: 0000562b9791d460 RDI: 0000000000000001
RBP: 0000562b9791d460 R08: 00007f73a6c53460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
R13: 00007f73a6c9d780 R14: 00007f73a6c99600 R15: 00007f73a6c98a00
</TASK>
---[ end trace 0000000000000000 ]---
... which happens because we can have:
A (*throttled; PELT is not frozen)
/ \
(new) C B (contains runnable tasks)
C is a new cgroup but is not enqueued on cfs_rq of A yet. B is another
cfs_rq with tasks and is enqueued on cfs_rq of A. Entire hierarchy shown
above is throttled.
When all tasks on B leaves, PELT of A and B is frozen but since C was
never queued on A and A's PELT wasn't throttled when it was created, C
still has pelt_clock_throttled as 0.
Next propagate_entity_cfs_rq() on C will see C doesn't have PELT clock
throttled, adding it to "rq->tmp_alone_branch" but then it sees A has
its PELT clock throttled which leads to the dangling reference, or
worse, incorrect ordering in leaf cfs_rq list.
Changes are prepared on top of tip:sched/core at commit 45b7f780739a
("sched: Fix some typos in include/linux/preempt.h")
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..9af82e214312 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)
cfs_rq->throttle_count = pcfs_rq->throttle_count;
cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+ /*
+ * It is not enough to sync the "pelt_clock_throttled" indicator
+ * with the parent cfs_rq when the hierarchy is not queued.
+ * Always join a throttled hierarchy with PELT clock throttled
+ * and leave it to the first enqueue, or distribution to
+ * unthrottle the PELT clock and add us on the leaf_cfs_rq_list.
+ */
+ if (cfs_rq->throttle_count)
+ cfs_rq->pelt_clock_throttled = 1;
}
/* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13192,6 +13202,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
if (!cfs_rq_pelt_clock_throttled(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
}
+
+ assert_list_leaf_cfs_rq(rq_of(cfs_rq));
}
#else /* !CONFIG_FAIR_GROUP_SCHED: */
static void propagate_entity_cfs_rq(struct sched_entity *se) { }
base-commit: 45b7f780739a3145aeef24d2dfa02517a6c82ed6
--
2.34.1
Powered by blists - more mailing lists