[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db7fc090-5c12-450b-87a4-bcf06e10ef68@amd.com>
Date: Thu, 25 Sep 2025 13:47:35 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Matteo Martelli
<matteo.martelli@...ethink.co.uk>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Chengming Zhou
<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
<xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
Rostedt" <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>,
Michal Koutný <mkoutny@...e.com>, Sebastian Andrzej Siewior
<bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
Hello Aaron, Matteo,
On 9/24/2025 5:03 PM, Aaron Lu wrote:
>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>
> I stared at the code and haven't been able to figure out when
> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
Yeah neither could I. I tried running with PREEMPT_RT too and still
couldn't trigger it :(
But I'm wondering if all we are missing is:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e146..5f9e7b4df391 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
cfs_rq->throttle_count = pcfs_rq->throttle_count;
cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+ cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
}
/* conditionally throttle active cfs_rq's from put_prev_entity() */
---
This is the only way we can currently have a break in
cfs_rq_pelt_clock_throttled() hierarchy.
>
> No matter what the culprit commit did, enqueue_task_fair() should always
> get all the non-queued cfs_rqs on the list in a bottom up way. It should
> either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
> of the ancestor cfs_rqs which is already on the list. Either way, the
> list should not be broken.
>
>> [ 18.421355] Modules linked in: efivarfs
>> [ 18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
>> [ 18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
>> [ 18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
>> [ 18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
>> [ 18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
>> [ 18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
>> [ 18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
>> [ 18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
>> [ 18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
>> [ 18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
>> [ 18.421375] FS: 00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
>> [ 18.421376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
>> [ 18.421377] Call Trace:
>> [ 18.421383] <TASK>
>> [ 18.421387] enqueue_task+0x31/0x70
>> [ 18.421389] ttwu_do_activate+0x73/0x220
>> [ 18.421391] try_to_wake_up+0x2b1/0x7a0
>> [ 18.421393] ? kmem_cache_alloc_node_noprof+0x7f/0x210
>> [ 18.421396] ep_autoremove_wake_function+0x12/0x40
>> [ 18.421400] __wake_up_common+0x72/0xa0
>> [ 18.421402] __wake_up_sync+0x38/0x50
>> [ 18.421404] ep_poll_callback+0xd2/0x240
>> [ 18.421406] __wake_up_common+0x72/0xa0
>> [ 18.421407] __wake_up_sync_key+0x3f/0x60
>> [ 18.421409] sock_def_readable+0x42/0xc0
>> [ 18.421414] unix_dgram_sendmsg+0x48f/0x840
>> [ 18.421420] ____sys_sendmsg+0x31c/0x350
>> [ 18.421423] ___sys_sendmsg+0x99/0xe0
>> [ 18.421425] __sys_sendmsg+0x8a/0xf0
>> [ 18.421429] do_syscall_64+0xa4/0x260
>> [ 18.421434] entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [ 18.421438] RIP: 0033:0x7fb07e8d4d94
>> [ 18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
>> [ 18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
>> [ 18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
>> [ 18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
>> [ 18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
>> [ 18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
>> [ 18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
>> [ 18.421445] </TASK>
>> [ 18.421446] ---[ end trace 0000000000000000 ]---
>>
>> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
>> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
>>
>> I hope this is helpful. I'm happy to provide more information or run
>> additional tests if needed.
>
> Yeah, definitely helpful, thanks.
>
> While looking at this commit, I'm thinking maybe we shouldn't use
> cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
> to rq's leaf list. The reason is, for a cfs_rq that is in throttled
> hierarchy, it can be removed from that leaf list when it has no entities
> left in dequeue_entity(). So even when it's on the list now doesn't
> mean it will still be on the list at unthrottle time.
>
> Considering that the purpose is to have cfs_rq and its ancestors to be
> added to the list in case this cfs_rq may have some removed load that
> needs to be decayed later as described in commit 0258bdfaff5b("sched/fair:
> Fix unfairness caused by missing load decay"), I'm thinking maybe we
> should deal with cfs_rqs differently according to whether it is in
> throttled hierarchy or not:
> - for cfs_rqs not in throttled hierarchy, add it and its ancestors to
> the list so that the removed load can be decayed;
> - for cfs_rqs in throttled hierarchy, check on unthrottle time whether
> it has any removed load that needs to be decayed.
> The case in my mind is: an blocked task @p gets attached to a throttled
> cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
> no tasks throttled or queued underneath it. Then @p is migrated to
> another cpu before being queued on it, so this cfs_rq now has some
> removed load on it. On unthrottle, this cfs_rq is considered fully
> decayed and isn't added to leaf cfs_rq list. Then we have a problem.
>
> With the above said, I'm thinking the below diff. No idea if this can
> fix Matteo's problem though, it's just something I think can fix the
> issue I described above, if I understand things correctly...
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..444f0eb2df71d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (child_cfs_rq_on_list(cfs_rq))
> return false;
>
> + if (cfs_rq->removed.nr)
> + return false;
If load_avg_is_decayed(), then having removed load makes no difference
right? We are not adding any weight to the tg and the sum/avg cannot go
negative so we are essentially removing nothing.
And, update_load_avg() would propagate the removed load anyways so does
this make a difference?
> +
> return true;
> }
>
> @@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> * change, make sure this cfs_rq stays on leaf cfs_rq list to have
> * that removed load decayed or it can cause faireness problem.
> */
> - if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> + if (!throttled_hierarchy(cfs_rq))
> list_add_leaf_cfs_rq(cfs_rq);
>
> /* Start to propagate at parent */
> @@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
>
> - if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> + if (!throttled_hierarchy(cfs_rq))
> list_add_leaf_cfs_rq(cfs_rq);
> }
> }
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists