linux-kernel - Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db7fc090-5c12-450b-87a4-bcf06e10ef68@amd.com>
Date: Thu, 25 Sep 2025 13:47:35 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Matteo Martelli
	<matteo.martelli@...ethink.co.uk>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
	<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
 Rostedt" <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
 Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>,
	Michal Koutný <mkoutny@...e.com>, Sebastian Andrzej Siewior
	<bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Hello Aaron, Matteo,

On 9/24/2025 5:03 PM, Aaron Lu wrote:
>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> 
> I stared at the code and haven't been able to figure out when
> enqueue_task_fair() would end up with a broken leaf cfs_rq list.

Yeah neither could I. I tried running with PREEMPT_RT too and still
couldn't trigger it :(

But I'm wondering if all we are missing is:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e146..5f9e7b4df391 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
 
 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
---

This is the only way we can currently have a break in
cfs_rq_pelt_clock_throttled() hierarchy.

> 
> No matter what the culprit commit did, enqueue_task_fair() should always
> get all the non-queued cfs_rqs on the list in a bottom up way. It should
> either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
> of the ancestor cfs_rqs which is already on the list. Either way, the
> list should not be broken.
> 
>> [   18.421355] Modules linked in: efivarfs
>> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
>> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
>> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
>> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
>> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
>> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
>> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
>> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
>> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
>> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
>> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
>> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
>> [   18.421377] Call Trace:
>> [   18.421383]  <TASK>
>> [   18.421387]  enqueue_task+0x31/0x70
>> [   18.421389]  ttwu_do_activate+0x73/0x220
>> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
>> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
>> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
>> [   18.421400]  __wake_up_common+0x72/0xa0
>> [   18.421402]  __wake_up_sync+0x38/0x50
>> [   18.421404]  ep_poll_callback+0xd2/0x240
>> [   18.421406]  __wake_up_common+0x72/0xa0
>> [   18.421407]  __wake_up_sync_key+0x3f/0x60
>> [   18.421409]  sock_def_readable+0x42/0xc0
>> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
>> [   18.421420]  ____sys_sendmsg+0x31c/0x350
>> [   18.421423]  ___sys_sendmsg+0x99/0xe0
>> [   18.421425]  __sys_sendmsg+0x8a/0xf0
>> [   18.421429]  do_syscall_64+0xa4/0x260
>> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [   18.421438] RIP: 0033:0x7fb07e8d4d94
>> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
>> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
>> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
>> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
>> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
>> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
>> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
>> [   18.421445]  </TASK>
>> [   18.421446] ---[ end trace 0000000000000000 ]---
>>
>> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
>> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
>>
>> I hope this is helpful. I'm happy to provide more information or run
>> additional tests if needed.
> 
> Yeah, definitely helpful, thanks.
> 
> While looking at this commit, I'm thinking maybe we shouldn't use
> cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
> to rq's leaf list. The reason is, for a cfs_rq that is in throttled
> hierarchy, it can be removed from that leaf list when it has no entities
> left in dequeue_entity(). So even when it's on the list now doesn't
> mean it will still be on the list at unthrottle time.
> 
> Considering that the purpose is to have cfs_rq and its ancestors to be
> added to the list in case this cfs_rq may have some removed load that
> needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
> Fix unfairness caused by missing load decay"), I'm thinking maybe we
> should deal with cfs_rqs differently according to whether it is in
> throttled hierarchy or not:
> - for cfs_rqs not in throttled hierarchy, add it and its ancestors to
>   the list so that the removed load can be decayed;
> - for cfs_rqs in throttled hierarchy, check on unthrottle time whether
>   it has any removed load that needs to be decayed.
>   The case in my mind is: an blocked task @p gets attached to a throttled
>   cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
>   no tasks throttled or queued underneath it. Then @p is migrated to
>   another cpu before being queued on it, so this cfs_rq now has some
>   removed load on it. On unthrottle, this cfs_rq is considered fully
>   decayed and isn't added to leaf cfs_rq list. Then we have a problem.
> 
> With the above said, I'm thinking the below diff. No idea if this can
> fix Matteo's problem though, it's just something I think can fix the
> issue I described above, if I understand things correctly...
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..444f0eb2df71d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  	if (child_cfs_rq_on_list(cfs_rq))
>  		return false;
>  
> +	if (cfs_rq->removed.nr)
> +		return false;

If load_avg_is_decayed(), then having removed load makes no difference
right? We are not adding any weight to the tg and the sum/avg cannot go
negative so we are essentially removing nothing.

And, update_load_avg() would propagate the removed load anyways so does
this make a difference?

> +
>  	return true;
>  }
>  
> @@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (!throttled_hierarchy(cfs_rq))
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (!throttled_hierarchy(cfs_rq))
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>  }

-- 
Thanks and Regards,
Prateek