linux-kernel - Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <14be66aa-e088-4267-ac10-d04d600b1294@amd.com>
Date: Wed, 3 Sep 2025 22:42:01 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, Aaron Lu <ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
	Chengming Zhou <chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>,
	Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
	<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, Steven
 Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, Florian
 Bezdeka <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>, Matteo Martelli
	<matteo.martelli@...ethink.co.uk>, Michal Koutn?? <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

Hello Peter,

On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  {
>> +	if (task_is_throttled(p)) {
>> +		dequeue_throttled_task(p, flags);
>> +		return true;
>> +	}
>> +
>>  	if (!p->se.sched_delayed)
>>  		util_est_dequeue(&rq->cfs, p);
>>  
> 
> OK, so this makes it so that either a task is fully enqueued (all
> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> its tasks are gone, and unthrottled when a task gets added. Right?

cfs_rq (and the hierarchy below) is marked throttled when the quota
has elapsed. Tasks on the throttled hierarchies will dequeue
themselves completely via task work added during pick. When the last
task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
that cfs_rq.

When a new task is added on the hierarchy, the PELT is unfrozen and
the task becomes runnable. The cfs_rq and the hierarchy is still
marked throttled.

Unthrottling of hierarchy is only done at distribution.

> 
> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> iteration.
> 
> This seems somewhat inconsistent; or am I missing something ? 

Probably an oversight. But before that, what was the reason to have
stopped this propagation at throttled_cfs_rq() before the changes?

Looking at commit 09a43ace1f98 ("sched/fair: Propagate load during
synchronous attach/detach") was it because the update_load_avg() from
enqueue_entity() loop previously in unthrottle_cfs_rq() would have
propagated the PELT to root on unthrottle?

If that was the intention, perhaps something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e..4b32785231bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5234,6 +5234,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq);
 
 static void
 requeue_delayed_entity(struct sched_entity *se);
@@ -5729,6 +5730,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6727,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -13151,7 +13162,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
+	if (cfs_rq_pelt_clock_throttled(cfs_rq))
 		return;
 
 	if (!throttled_hierarchy(cfs_rq))
@@ -13165,7 +13176,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_pelt_clock_throttled(cfs_rq))
 			break;
 
 		if (!throttled_hierarchy(cfs_rq))
---

The PELT will then be propagated to root either via the
update_load_avg() call when PELT unfreezes or through the
__update_blocked_fair() path after unthrottle. Thoughts?

-- 
Thanks and Regards,
Prateek