linux-kernel - [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250926081918.30488-1-kprateek.nayak@amd.com>
Date: Fri, 26 Sep 2025 08:19:17 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Matteo Martelli
	<matteo.martelli@...ethink.co.uk>, Aaron Lu <ziqianlu@...edance.com>,
	<linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
	<kprateek.nayak@....com>
Subject: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

    root(#)
   /    \
  A(#)   B(*)
         |
         C <--- new cgroup
         |
         D <--- new cgroup

  # - Already on leaf cfs_rq list
  * - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
   propagate_entity_cfs_rq(). (Same can happen if a throttled task is
   moved to cgroup C and enqueue_task_fair() returns early.)

   propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
   "rq->tmp_alone_branch" since its PELT clock is not marked throttled
   and cfs_rq of cgroup B is not on the list.

   cfs_rq of cgroup B is skipped since its PELT is throttled.

   root cfs_rq already exists on cfs_rq leading to
   list_add_leaf_cfs_rq() returning early.

   The cfs_rq of cgroup C is left dangling on the
   "rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
   on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
   without any modifications to "rq->tmp_alone_branch".

   The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
   dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

   !!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Suggested-by: Aaron Lu <ziqianlu@...edance.com>
Reported-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Stress test included running sched-messaging in nested hierarchy with
various quota set alongside a continuous loop of cgroup creation and
deletion, as well as another loop of continuous movement of a busy loop
between cgroups.

No splats have been observed yet with this patch.

Aaron, Matteo,

I've not added any "Tested-by" tags since the final diff is slightly
different from the diff shared previously. The previous diff led to this
splat during my test with the newly added assert:

  ------------[ cut here ]------------
  WARNING: CPU: 6 PID: 22912 at kernel/sched/fair.c:400 propagate_entity_cfs_rq+0x1f9/0x270
  CPU: 6 UID: 0 PID: 22912 Comm: bash Not tainted 6.17.0-rc4-test+ #50 PREEMPT_{RT,(full)}
  Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
  RIP: 0010:propagate_entity_cfs_rq+0x1f9/0x270
  Code: ...
  RSP: 0018:ffffd4a203713928 EFLAGS: 00010087
  RAX: ffff8ea4714b1e00 RBX: 0000000000000000 RCX: 0000000000000001
  RDX: ffff8ea4714b28e8 RSI: 0000003570150cf8 RDI: ffff8e65abc69400
  RBP: ffff8ea4714b1f00 R08: 0000000000006160 R09: 0000000000016d5a
  R10: 000000000000009c R11: 0000000000000000 R12: ffff8ea4714b1e00
  R13: ffff8e6586df1a00 R14: 0000000000000001 R15: 0000000000000246
  FS:  00007f73a6a7f740(0000) GS:ffff8e64d0689000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000562b9791d461 CR3: 000000010d875003 CR4: 0000000000f70ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   sched_move_task+0xc2/0x280
   cpu_cgroup_attach+0x37/0x70
   cgroup_migrate_execute+0x3b8/0x560
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   cgroup_attach_task+0x15b/0x200
   ? cgroup_attach_permissions+0x17e/0x1e0
   __cgroup_procs_write+0x105/0x170
   cgroup_procs_write+0x17/0x30
   kernfs_fop_write_iter+0x127/0x1c0
   vfs_write+0x305/0x420
   ksys_write+0x6b/0xe0
   do_syscall_64+0x85/0xb30
   ? __x64_sys_close+0x3d/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? wp_page_reuse.constprop.0+0x7c/0x90
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_wp_page+0x3df/0xd70
   ? set_close_on_exec+0x31/0x70
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_fcntl+0x3f6/0x760
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __handle_mm_fault+0x375/0x380
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events+0xd9/0x1c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1af/0x290
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x2d0/0x8c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f73a6b968f7
  Code: ...
  RSP: 002b:00007ffebcfdc688 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f73a6b968f7
  RDX: 0000000000000006 RSI: 0000562b9791d460 RDI: 0000000000000001
  RBP: 0000562b9791d460 R08: 00007f73a6c53460 R09: 000000007fffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
  R13: 00007f73a6c9d780 R14: 00007f73a6c99600 R15: 00007f73a6c98a00
   </TASK>
  ---[ end trace 0000000000000000 ]---


... which happens because we can have:

	  A (*throttled; PELT is not frozen)
        /   \
 (new) C     B (contains runnable tasks)

C is a new cgroup but is not enqueued on cfs_rq of A yet. B is another
cfs_rq with tasks and is enqueued on cfs_rq of A. Entire hierarchy shown
above is throttled.

When all tasks on B leaves, PELT of A and B is frozen but since C was
never queued on A and A's PELT wasn't throttled when it was created, C
still has pelt_clock_throttled as 0.

Next propagate_entity_cfs_rq() on C will see C doesn't have PELT clock
throttled, adding it to "rq->tmp_alone_branch" but then it sees A has
its PELT clock throttled which leads to the dangling reference, or
worse, incorrect ordering in leaf cfs_rq list.

Changes are prepared on top of tip:sched/core at commit 45b7f780739a
("sched: Fix some typos in include/linux/preempt.h")
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..9af82e214312 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)
 
 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+	/*
+	 * It is not enough to sync the "pelt_clock_throttled" indicator
+	 * with the parent cfs_rq when the hierarchy is not queued.
+	 * Always join a throttled hierarchy with PELT clock throttled
+	 * and leave it to the first enqueue, or distribution to
+	 * unthrottle the PELT clock and add us on the leaf_cfs_rq_list.
+	 */
+	if (cfs_rq->throttle_count)
+		cfs_rq->pelt_clock_throttled = 1;
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13192,6 +13202,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
+
+	assert_list_leaf_cfs_rq(rq_of(cfs_rq));
 }
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }

base-commit: 45b7f780739a3145aeef24d2dfa02517a6c82ed6
-- 
2.34.1