linux-kernel - Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5b947108-9cc0-4b33-abb2-482825756911@amd.com>
Date: Fri, 26 Sep 2025 15:41:22 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Peter Zijlstra <peterz@...radead.org>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Matteo Martelli
	<matteo.martelli@...ethink.co.uk>, <linux-kernel@...r.kernel.org>, "Dietmar
 Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Valentin
 Schneider" <vschneid@...hat.com>, Chengming Zhou <chengming.zhou@...ux.dev>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with
 PELT clock throttled

(+Chengming for the discussion on the PELT bits)

Hello Aaron,

On 9/26/2025 3:08 PM, Aaron Lu wrote:
>> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
>> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
>> dequeue on other subtree on the throttled hierarchy can freeze the PELT
>> clock for the parent hierarchy without setting the indicators for this
>> newly added cfs_rq which was never enqueued.
>>
> 
> Sigh...

I had the exact reaction when I managed to trigger that again (T_T)

> 
>> Since there are no tasks on the new hierarchy, start a cfs_rq on a
>> throttled hierarchy with its PELT clock throttled. The first enqueue, or
>> the distribution (whichever happens first) will unfreeze the PELT clock
>> and queue the cfs_rq on the leaf cfs_rq list.
>>
> 
> Makes sense.
> 
>> While at it, add an assert_list_leaf_cfs_rq() in
>> propagate_entity_cfs_rq() to catch such cases in the future.
>>
>> Suggested-by: Aaron Lu <ziqianlu@...edance.com>
>> Reported-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>
>> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
>> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> 
> Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
> model")? "Task based throttle time accounting" doesn't touch pelt bits.

True that the issue is seen after e1fad12dcb66 but I though this can be
considered as a fix for the initial introduction of indicator. I don't
have any strong feelings on this.

Peter, can you update the commit in the fixes tag if and when you pick
this?

> 
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> 
> Reviewed-by: Aaron Lu <ziqianlu@...edance.com>
> Tested-by: Aaron Lu <ziqianlu@...edance.com>
> 
> Thanks for the fix.

Thank you for helping with the debug and testing :)

> 
> BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
> ancestor cfs_rq's pelt clock throttled status but only the first level
> cfs_rq's, because the purpose is to have the first level cfs_rq to stay
> on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
> are just to make sure the list is fully connected. I mean something like
> this:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75c615f5ed640..6a6d9200ab93c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  static void propagate_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);

I actually think what it is doing now is correct already. A part of the
hierarchy, between the root and the first throttled cfs_rq get the load
via propagate and this needs to be decayed, otherwise CPU might appear
busy.

Consider: root (decayed) -> A (decayed) -> B (*throttled)

If a heavy task is migrated to B, the load is propagated to B, A, and
then to root. If we don't queue the lhem on leaf_cfs_rq_list until
distribution, the root cfs_rq will appear to have some load and the load
balancer may get confused during find_busiest_group().

Now, if we place A, and root cfs_rq back on leaf_cfs_rq_list, each ILB
stats kick will decay the load and eventually the CPU will appear idle
and post unthrottle, the load will start accumulating again.

>  
>  	/*
>  	 * If a task gets attached to this cfs_rq and before being queued,
> @@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (add)
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (add)
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
> 
> But this is a different thing and can be taken care of if necessary
> later. Current logic doesn't have a problem, it's just not as clear as
> the above diff to me.

Ack! Ben, Chengming have better idea of these bits and perhaps they can
shed some light on the intended purpose and what is the right thing to
do.

-- 
Thanks and Regards,
Prateek