linux-kernel - Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904081611.GE42@bytedance>
Date: Thu, 4 Sep 2025 16:16:11 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Valentin Schneider <vschneid@...hat.com>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Matteo Martelli <matteo.martelli@...ethink.co.uk>,
	Michal Koutn?? <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model

On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote:
> K Prateek Nayak <kprateek.nayak@....com> writes:
> 
> > Hello Peter,
> >
> > On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
> >>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  {
> >>> +	if (task_is_throttled(p)) {
> >>> +		dequeue_throttled_task(p, flags);
> >>> +		return true;
> >>> +	}
> >>> +
> >>>  	if (!p->se.sched_delayed)
> >>>  		util_est_dequeue(&rq->cfs, p);
> >>>  
> >> 
> >> OK, so this makes it so that either a task is fully enqueued (all
> >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> >> its tasks are gone, and unthrottled when a task gets added. Right?
> >
> > cfs_rq (and the hierarchy below) is marked throttled when the quota
> > has elapsed. Tasks on the throttled hierarchies will dequeue
> > themselves completely via task work added during pick. When the last
> > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
> > that cfs_rq.
> >
> > When a new task is added on the hierarchy, the PELT is unfrozen and
> > the task becomes runnable. The cfs_rq and the hierarchy is still
> > marked throttled.
> >
> > Unthrottling of hierarchy is only done at distribution.
> >
> >> 
> >> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> >> iteration.
> >> 
> >> This seems somewhat inconsistent; or am I missing something ? 
> >
> > Probably an oversight. But before that, what was the reason to have
> > stopped this propagation at throttled_cfs_rq() before the changes?
> >
> 
> Yeah, this was one of the things I was (slowly) looking at - with this
> series we currently still abort in:
> 
> 1) update_cfs_group
> 2) dequeue_entities's set_next_buddy
> 3) check_preempt_fair
> 4) yield_to
> 5) propagate_entity_cfs_rq
> 
> In the old design on throttle immediately remove the entire cfs_rq,
> freeze time for it, and stop adjusting load. In the new design we still
> pick from it, so we definitely don't want to stop time (and don't). I'm
> guessing we probably also want to now adjust load for it, but it is
> arguable - since all the cfs_rqs for the tg are likely to throttle at the
> same time, so we might not want to mess with the shares distribution,
> since when unthrottle comes around the most likely correct distribution
> is the distribution we had at the time of throttle.
>

I can give it a test to see how things change by adjusting load and share
distribution using my previous performance tests.

> Assuming we do want to adjust load for a throttle then we probably want
> to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> I'm not 100% sure when they are actually doing something in propagate as
> opposed to enqueue.
>

Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing 
load decay") added that list_add_leaf_cfs_rq() in
propagate_entity_cfs_rq() to fix a problem.

> The other 3 are the same sort of thing - scheduling pick heuristics
> which imo are pretty arbitrary to keep. We can reasonably say that "the
> most likely thing a task in a throttled hierarchy will do is just go
> throttle itself, so we shouldn't buddy it or let it preempt", but it
> would also be reasonable to let them preempt/buddy normally, in case
> they hold locks or such.

I think we do not need to special case tasks in throttled hierarchy in
check_preempt_wakeup_fair().

> 
> yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> abort there). The dma code is just yielding to a just-spawned kthread,
> so it should be fine either way.

Get it.

The cumulated diff I'm going to experiment is below, let me know if
something is wrong, thanks.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e927b9b7eeb6..c2e46b8e5e3d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
-	if (throttled_hierarchy(gcfs_rq))
-		return;
-
 	shares = calc_group_shares(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);
@@ -5234,6 +5231,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq);
 
 static void
 requeue_delayed_entity(struct sched_entity *se);
@@ -5729,6 +5727,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6724,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -7074,7 +7082,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 			 * Bias pick_next to pick a task from this cfs_rq, as
 			 * p is sleeping when it is within its sched_slice.
 			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+			if (task_sleep && se)
 				set_next_buddy(se);
 			break;
 		}
@@ -8722,15 +8730,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (unlikely(se == pse))
 		return;
 
-	/*
-	 * This is possible from callers such as attach_tasks(), in which we
-	 * unconditionally wakeup_preempt() after an enqueue (which may have
-	 * lead to a throttle).  This both saves work and prevents false
-	 * next-buddy nomination below.
-	 */
-	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
-		return;
-
 	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
 		set_next_buddy(pse);
 	}
@@ -13154,10 +13153,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
-		return;
-
-	if (!throttled_hierarchy(cfs_rq))
+	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13168,10 +13164,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		if (!throttled_hierarchy(cfs_rq))
+		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }
-- 
2.39.5