[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904081611.GE42@bytedance>
Date: Thu, 4 Sep 2025 16:16:11 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Benjamin Segall <bsegall@...gle.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
Peter Zijlstra <peterz@...radead.org>,
Valentin Schneider <vschneid@...hat.com>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>,
Songtang Liu <liusongtang@...edance.com>,
Chen Yu <yu.c.chen@...el.com>,
Matteo Martelli <matteo.martelli@...ethink.co.uk>,
Michal Koutn?? <mkoutny@...e.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model
On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote:
> K Prateek Nayak <kprateek.nayak@....com> writes:
>
> > Hello Peter,
> >
> > On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
> >>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> {
> >>> + if (task_is_throttled(p)) {
> >>> + dequeue_throttled_task(p, flags);
> >>> + return true;
> >>> + }
> >>> +
> >>> if (!p->se.sched_delayed)
> >>> util_est_dequeue(&rq->cfs, p);
> >>>
> >>
> >> OK, so this makes it so that either a task is fully enqueued (all
> >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> >> its tasks are gone, and unthrottled when a task gets added. Right?
> >
> > cfs_rq (and the hierarchy below) is marked throttled when the quota
> > has elapsed. Tasks on the throttled hierarchies will dequeue
> > themselves completely via task work added during pick. When the last
> > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
> > that cfs_rq.
> >
> > When a new task is added on the hierarchy, the PELT is unfrozen and
> > the task becomes runnable. The cfs_rq and the hierarchy is still
> > marked throttled.
> >
> > Unthrottling of hierarchy is only done at distribution.
> >
> >>
> >> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> >> iteration.
> >>
> >> This seems somewhat inconsistent; or am I missing something ?
> >
> > Probably an oversight. But before that, what was the reason to have
> > stopped this propagation at throttled_cfs_rq() before the changes?
> >
>
> Yeah, this was one of the things I was (slowly) looking at - with this
> series we currently still abort in:
>
> 1) update_cfs_group
> 2) dequeue_entities's set_next_buddy
> 3) check_preempt_fair
> 4) yield_to
> 5) propagate_entity_cfs_rq
>
> In the old design on throttle immediately remove the entire cfs_rq,
> freeze time for it, and stop adjusting load. In the new design we still
> pick from it, so we definitely don't want to stop time (and don't). I'm
> guessing we probably also want to now adjust load for it, but it is
> arguable - since all the cfs_rqs for the tg are likely to throttle at the
> same time, so we might not want to mess with the shares distribution,
> since when unthrottle comes around the most likely correct distribution
> is the distribution we had at the time of throttle.
>
I can give it a test to see how things change by adjusting load and share
distribution using my previous performance tests.
> Assuming we do want to adjust load for a throttle then we probably want
> to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> I'm not 100% sure when they are actually doing something in propagate as
> opposed to enqueue.
>
Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing
load decay") added that list_add_leaf_cfs_rq() in
propagate_entity_cfs_rq() to fix a problem.
> The other 3 are the same sort of thing - scheduling pick heuristics
> which imo are pretty arbitrary to keep. We can reasonably say that "the
> most likely thing a task in a throttled hierarchy will do is just go
> throttle itself, so we shouldn't buddy it or let it preempt", but it
> would also be reasonable to let them preempt/buddy normally, in case
> they hold locks or such.
I think we do not need to special case tasks in throttled hierarchy in
check_preempt_wakeup_fair().
>
> yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> abort there). The dma code is just yielding to a just-spawned kthread,
> so it should be fine either way.
Get it.
The cumulated diff I'm going to experiment is below, let me know if
something is wrong, thanks.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e927b9b7eeb6..c2e46b8e5e3d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
if (!gcfs_rq || !gcfs_rq->load.weight)
return;
- if (throttled_hierarchy(gcfs_rq))
- return;
-
shares = calc_group_shares(gcfs_rq);
if (unlikely(se->load.weight != shares))
reweight_entity(cfs_rq_of(se), se, shares);
@@ -5234,6 +5231,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq);
static void
requeue_delayed_entity(struct sched_entity *se);
@@ -5729,6 +5727,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttled;
}
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+ return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
/* check whether cfs_rq, or any parent, is throttled */
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{
@@ -6721,6 +6724,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return 0;
}
+static inline int cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{
return 0;
@@ -7074,7 +7082,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
* Bias pick_next to pick a task from this cfs_rq, as
* p is sleeping when it is within its sched_slice.
*/
- if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+ if (task_sleep && se)
set_next_buddy(se);
break;
}
@@ -8722,15 +8730,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
if (unlikely(se == pse))
return;
- /*
- * This is possible from callers such as attach_tasks(), in which we
- * unconditionally wakeup_preempt() after an enqueue (which may have
- * lead to a throttle). This both saves work and prevents false
- * next-buddy nomination below.
- */
- if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
- return;
-
if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
set_next_buddy(pse);
}
@@ -13154,10 +13153,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- if (cfs_rq_throttled(cfs_rq))
- return;
-
- if (!throttled_hierarchy(cfs_rq))
+ if (!cfs_rq_pelt_clock_throttled(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
/* Start to propagate at parent */
@@ -13168,10 +13164,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
update_load_avg(cfs_rq, se, UPDATE_TG);
- if (cfs_rq_throttled(cfs_rq))
- break;
-
- if (!throttled_hierarchy(cfs_rq))
+ if (!cfs_rq_pelt_clock_throttled(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
}
}
--
2.39.5
Powered by blists - more mailing lists