linux-kernel - Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250924113354.GA120@bytedance>
Date: Wed, 24 Sep 2025 19:33:54 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Matteo Martelli <matteo.martelli@...ethink.co.uk>
Cc: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Michal Koutný <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Hi Matteo,

On Tue, Sep 23, 2025 at 03:05:29PM +0200, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@...edance.com> wrote:
> > Before task based throttle model, propagating load will stop at a
> > throttled cfs_rq and that propagate will happen on unthrottle time by
> > update_load_avg().
> > 
> > Now that there is no update_load_avg() on unthrottle for throttled
> > cfs_rq and all load tracking is done by task related operations, let the
> > propagate happen immediately.
> > 
> > While at it, add a comment to explain why cfs_rqs that are not affected
> > by throttle have to be added to leaf cfs_rq list in
> > propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> > ("sched/fair: Fix unfairness caused by missing load decay").
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
> > ---
> 
> I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
> when task exits to user" [1] together with these follow up patches. I
> found out that with this patch the kernel sometimes produces the warning
> WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
> assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
> reproduce this systematically by applying both [1] and this patch on top
> of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
> from sched/core branch of tip tree. I couldn't reproduce the warning by
> testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
> throttled_lb_pair()").
>

Thanks a lot for the test.

> The test setup is the same used in my previous testing for v3 [2], where
> the CFS throttling events are mostly triggered by the first ssh logins
> into the system as the systemd user slice is configured with CPUQuota of
> 25%. Also note that the same systemd user slice is configured with CPU

I tried to replicate this setup, below is my setup using a 4 cpu VM
and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
throttled cfs_rq"):
# pwd
/sys/fs/cgroup/user.slice
# cat cpu.max
25000 100000
# cat cpuset.cpus
0

I then login using ssh as a normal user and I can see throttle happened
but couldn't hit this warning. Do you have to do something special to
trigger it?

> affinity set to only one core. I added some tracing to trace functions
> throttle_cfs_rq, tg_throttle_down, unthrottle_cfs_rq, tg_unthrottle_up,
> and it looks like the warning is triggered after the last unthrottle
> event, however I'm not sure the warning is actually related to the
> printed trace below or not. See the following logs that contains both
> the traced function events and the kernel warning.
> 
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865040: throttle_cfs_rq <-pick_task_fair
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865043: tg_throttle_down <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882601: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882603: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916019: throttle_cfs_rq <-put_prev_entity
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982575: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982578: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982579: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982580: tg_unthrottle_up <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015030: throttle_cfs_rq <-pick_task_fair
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082607: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082609: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115604: throttle_cfs_rq <-put_prev_entity
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182630: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182632: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182633: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182634: tg_unthrottle_up <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226594: throttle_cfs_rq <-pick_task_fair
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282606: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282608: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.421349] ------------[ cut here ]------------
> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980

I stared at the code and haven't been able to figure out when
enqueue_task_fair() would end up with a broken leaf cfs_rq list.

No matter what the culprit commit did, enqueue_task_fair() should always
get all the non-queued cfs_rqs on the list in a bottom up way. It should
either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
of the ancestor cfs_rqs which is already on the list. Either way, the
list should not be broken.

> [   18.421355] Modules linked in: efivarfs
> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
> [   18.421377] Call Trace:
> [   18.421383]  <TASK>
> [   18.421387]  enqueue_task+0x31/0x70
> [   18.421389]  ttwu_do_activate+0x73/0x220
> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
> [   18.421400]  __wake_up_common+0x72/0xa0
> [   18.421402]  __wake_up_sync+0x38/0x50
> [   18.421404]  ep_poll_callback+0xd2/0x240
> [   18.421406]  __wake_up_common+0x72/0xa0
> [   18.421407]  __wake_up_sync_key+0x3f/0x60
> [   18.421409]  sock_def_readable+0x42/0xc0
> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
> [   18.421420]  ____sys_sendmsg+0x31c/0x350
> [   18.421423]  ___sys_sendmsg+0x99/0xe0
> [   18.421425]  __sys_sendmsg+0x8a/0xf0
> [   18.421429]  do_syscall_64+0xa4/0x260
> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   18.421438] RIP: 0033:0x7fb07e8d4d94
> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
> [   18.421445]  </TASK>
> [   18.421446] ---[ end trace 0000000000000000 ]---
> 
> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
> 
> I hope this is helpful. I'm happy to provide more information or run
> additional tests if needed.

Yeah, definitely helpful, thanks.

While looking at this commit, I'm thinking maybe we shouldn't use
cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
to rq's leaf list. The reason is, for a cfs_rq that is in throttled
hierarchy, it can be removed from that leaf list when it has no entities
left in dequeue_entity(). So even when it's on the list now doesn't
mean it will still be on the list at unthrottle time.

Considering that the purpose is to have cfs_rq and its ancestors to be
added to the list in case this cfs_rq may have some removed load that
needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
Fix unfairness caused by missing load decay"), I'm thinking maybe we
should deal with cfs_rqs differently according to whether it is in
throttled hierarchy or not:
- for cfs_rqs not in throttled hierarchy, add it and its ancestors to
  the list so that the removed load can be decayed;
- for cfs_rqs in throttled hierarchy, check on unthrottle time whether
  it has any removed load that needs to be decayed.
  The case in my mind is: an blocked task @p gets attached to a throttled
  cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
  no tasks throttled or queued underneath it. Then @p is migrated to
  another cpu before being queued on it, so this cfs_rq now has some
  removed load on it. On unthrottle, this cfs_rq is considered fully
  decayed and isn't added to leaf cfs_rq list. Then we have a problem.

With the above said, I'm thinking the below diff. No idea if this can
fix Matteo's problem though, it's just something I think can fix the
issue I described above, if I understand things correctly...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e1466..444f0eb2df71d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->removed.nr)
+		return false;
+
 	return true;
 }
 
@@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
 	 * that removed load decayed or it can cause faireness problem.
 	 */
-	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+	if (!throttled_hierarchy(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+		if (!throttled_hierarchy(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }