linux-kernel - Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72706108-f1c3-4719-a65c-c7c5d76f9b1e@amd.com>
Date: Thu, 25 Sep 2025 16:52:25 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Matteo Martelli <matteo.martelli@...ethink.co.uk>, Valentin Schneider
	<vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>, Peter Zijlstra
	<peterz@...radead.org>, Chengming Zhou <chengming.zhou@...ux.dev>, Josh Don
	<joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Xi Wang <xii@...gle.com>,
	<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, "Dietmar
 Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Mel Gorman <mgorman@...e.de>, Chuyi Zhou <zhouchuyi@...edance.com>, "Jan
 Kiszka" <jan.kiszka@...mens.com>, Florian Bezdeka
	<florian.bezdeka@...mens.com>, Songtang Liu <liusongtang@...edance.com>,
	"Chen Yu" <yu.c.chen@...el.com>, Michal Koutný
	<mkoutny@...e.com>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq



On 9/25/2025 2:59 PM, Aaron Lu wrote:
> Hi Prateek,
> 
> On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
>> Hello Aaron, Matteo,
>>
>> On 9/24/2025 5:03 PM, Aaron Lu wrote:
>>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>>>
>>> I stared at the code and haven't been able to figure out when
>>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
>>
>> Yeah neither could I. I tried running with PREEMPT_RT too and still
>> couldn't trigger it :(
>>
>> But I'm wondering if all we are missing is:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f993de30e146..5f9e7b4df391 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>>  
>>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
>> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>>  }
>>  
>>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
>> ---
>>
>> This is the only way we can currently have a break in
>> cfs_rq_pelt_clock_throttled() hierarchy.
>>
> 
> Great finding! Yes, that is missed.
> 
> According to this info, I'm able to trigger the assert in
> enqueue_task_fair(). The stack is different from Matteo's: his stack is
> from ttwu path while mine is from exit. Anyway, let me do more analysis
> and get back to you:
> 
> [   67.041905] ------------[ cut here ]------------
> [   67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> [   67.043227] Modules linked in:
> [   67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G        W           6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> [   67.044694] Tainted: [W]=WARN
> [   67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> [   67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> [   67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> [   67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> [   67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> [   67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> [   67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> [   67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> [   67.052083] FS:  0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> [   67.052855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> [   67.054083] Call Trace:
> [   67.054334]  <TASK>
> [   67.054546]  enqueue_task+0x35/0xd0
> [   67.054885]  sched_move_task+0x291/0x370
> [   67.055268]  ? kmem_cache_free+0x2d9/0x480
> [   67.055669]  do_exit+0x204/0x4f0
> [   67.055984]  ? lock_release+0x10a/0x170
> [   67.056356]  do_group_exit+0x36/0xa0
> [   67.056714]  __x64_sys_exit_group+0x18/0x20
> [   67.057121]  x64_sys_call+0x14fa/0x1720
> [   67.057502]  do_syscall_64+0x6a/0x2d0
> [   67.057865]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Great! I'll try stressing this path too.
P.S. Are you seeing this with sync_throttle() fix too?

[..snip..]

>>
>> If load_avg_is_decayed(), then having removed load makes no difference
>> right? We are not adding any weight to the tg and the sum/avg cannot go
>> negative so we are essentially removing nothing.
>>
>> And, update_load_avg() would propagate the removed load anyways so does
>> this make a difference?
>>
> 
> You are right. I misunderstood the meanning of removed load, I thought
> the load was transferred to the removed part but actually, the load is
> still there in the cfs_rq when a task migrates away.
> 
> Having a positive removed.nr but fully decayed load avg looks strange
> to me, maybe we can avoid this by doing something below, it should
> be able to save some cycles by avoiding taking a lock and later dealing
> with zero removed load in update_cfs_rq_load_avg(). Just a thought:
> (I had a vague memory that util_avg and runnable_avg should always be
> smaller than load_avg, if so, we can simplify the condition by just
> checking load_avg)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..130db255a1ef6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4788,6 +4808,10 @@ static void remove_entity_load_avg(struct sched_entity *se)
>  
>  	sync_entity_load_avg(se);
>  
> +	/* It's possible this entity has no load left after sync */
> +	if (!se->avg.util_avg && !se->avg.load_avg && !se->avg.runnable_avg)
> +		return;
> +

This makes sense. Maybe we can rename the current "load_avg_is_decayed()"
to "load_sum_is_decayed()" and extract the condition from the
WARN_ON_ONCE() in it to "load_avg_is_decayed()" and use it here.
Thoughts?

P.S. There is this other patch that also touches this bit
https://lore.kernel.org/lkml/20250910084316.356169-1-hupu.gm@gmail.com/
Maybe we can use load_avg_is_decayed() itself here.

>  	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
>  	++cfs_rq->removed.nr;
>  	cfs_rq->removed.util_avg	+= se->avg.util_avg;

-- 
Thanks and Regards,
Prateek