[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72706108-f1c3-4719-a65c-c7c5d76f9b1e@amd.com>
Date: Thu, 25 Sep 2025 16:52:25 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Matteo Martelli <matteo.martelli@...ethink.co.uk>, Valentin Schneider
<vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>, Peter Zijlstra
<peterz@...radead.org>, Chengming Zhou <chengming.zhou@...ux.dev>, Josh Don
<joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Xi Wang <xii@...gle.com>,
<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, "Dietmar
Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Mel Gorman <mgorman@...e.de>, Chuyi Zhou <zhouchuyi@...edance.com>, "Jan
Kiszka" <jan.kiszka@...mens.com>, Florian Bezdeka
<florian.bezdeka@...mens.com>, Songtang Liu <liusongtang@...edance.com>,
"Chen Yu" <yu.c.chen@...el.com>, Michal Koutný
<mkoutny@...e.com>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
On 9/25/2025 2:59 PM, Aaron Lu wrote:
> Hi Prateek,
>
> On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
>> Hello Aaron, Matteo,
>>
>> On 9/24/2025 5:03 PM, Aaron Lu wrote:
>>>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>>>
>>> I stared at the code and haven't been able to figure out when
>>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
>>
>> Yeah neither could I. I tried running with PREEMPT_RT too and still
>> couldn't trigger it :(
>>
>> But I'm wondering if all we are missing is:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f993de30e146..5f9e7b4df391 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>>
>> cfs_rq->throttle_count = pcfs_rq->throttle_count;
>> cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
>> + cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>> }
>>
>> /* conditionally throttle active cfs_rq's from put_prev_entity() */
>> ---
>>
>> This is the only way we can currently have a break in
>> cfs_rq_pelt_clock_throttled() hierarchy.
>>
>
> Great finding! Yes, that is missed.
>
> According to this info, I'm able to trigger the assert in
> enqueue_task_fair(). The stack is different from Matteo's: his stack is
> from ttwu path while mine is from exit. Anyway, let me do more analysis
> and get back to you:
>
> [ 67.041905] ------------[ cut here ]------------
> [ 67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> [ 67.043227] Modules linked in:
> [ 67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G W 6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> [ 67.044694] Tainted: [W]=WARN
> [ 67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> [ 67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> [ 67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> [ 67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> [ 67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> [ 67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> [ 67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> [ 67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> [ 67.052083] FS: 0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> [ 67.052855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> [ 67.054083] Call Trace:
> [ 67.054334] <TASK>
> [ 67.054546] enqueue_task+0x35/0xd0
> [ 67.054885] sched_move_task+0x291/0x370
> [ 67.055268] ? kmem_cache_free+0x2d9/0x480
> [ 67.055669] do_exit+0x204/0x4f0
> [ 67.055984] ? lock_release+0x10a/0x170
> [ 67.056356] do_group_exit+0x36/0xa0
> [ 67.056714] __x64_sys_exit_group+0x18/0x20
> [ 67.057121] x64_sys_call+0x14fa/0x1720
> [ 67.057502] do_syscall_64+0x6a/0x2d0
> [ 67.057865] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Great! I'll try stressing this path too.
P.S. Are you seeing this with sync_throttle() fix too?
[..snip..]
>>
>> If load_avg_is_decayed(), then having removed load makes no difference
>> right? We are not adding any weight to the tg and the sum/avg cannot go
>> negative so we are essentially removing nothing.
>>
>> And, update_load_avg() would propagate the removed load anyways so does
>> this make a difference?
>>
>
> You are right. I misunderstood the meanning of removed load, I thought
> the load was transferred to the removed part but actually, the load is
> still there in the cfs_rq when a task migrates away.
>
> Having a positive removed.nr but fully decayed load avg looks strange
> to me, maybe we can avoid this by doing something below, it should
> be able to save some cycles by avoiding taking a lock and later dealing
> with zero removed load in update_cfs_rq_load_avg(). Just a thought:
> (I had a vague memory that util_avg and runnable_avg should always be
> smaller than load_avg, if so, we can simplify the condition by just
> checking load_avg)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..130db255a1ef6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4788,6 +4808,10 @@ static void remove_entity_load_avg(struct sched_entity *se)
>
> sync_entity_load_avg(se);
>
> + /* It's possible this entity has no load left after sync */
> + if (!se->avg.util_avg && !se->avg.load_avg && !se->avg.runnable_avg)
> + return;
> +
This makes sense. Maybe we can rename the current "load_avg_is_decayed()"
to "load_sum_is_decayed()" and extract the condition from the
WARN_ON_ONCE() in it to "load_avg_is_decayed()" and use it here.
Thoughts?
P.S. There is this other patch that also touches this bit
https://lore.kernel.org/lkml/20250910084316.356169-1-hupu.gm@gmail.com/
Maybe we can use load_avg_is_decayed() itself here.
> raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
> ++cfs_rq->removed.nr;
> cfs_rq->removed.util_avg += se->avg.util_avg;
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists