[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250926093801.GE120@bytedance>
Date: Fri, 26 Sep 2025 17:38:01 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Matteo Martelli <matteo.martelli@...ethink.co.uk>,
linux-kernel@...r.kernel.org,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with
PELT clock throttled
On Fri, Sep 26, 2025 at 08:19:17AM +0000, K Prateek Nayak wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
>
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
>
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
>
> root(#)
> / \
> A(#) B(*)
> |
> C <--- new cgroup
> |
> D <--- new cgroup
>
> # - Already on leaf cfs_rq list
> * - Throttled with PELT frozen
>
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
>
> 1. online_fair_sched_group() for cgroup D will call
> propagate_entity_cfs_rq(). (Same can happen if a throttled task is
> moved to cgroup C and enqueue_task_fair() returns early.)
>
> propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
> "rq->tmp_alone_branch" since its PELT clock is not marked throttled
> and cfs_rq of cgroup B is not on the list.
>
> cfs_rq of cgroup B is skipped since its PELT is throttled.
>
> root cfs_rq already exists on cfs_rq leading to
> list_add_leaf_cfs_rq() returning early.
>
> The cfs_rq of cgroup C is left dangling on the
> "rq->tmp_alone_branch".
>
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
> on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
> without any modifications to "rq->tmp_alone_branch".
>
> The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
> dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
>
> !!! Splat !!!
>
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
>
Sigh...
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
>
Makes sense.
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
>
> Suggested-by: Aaron Lu <ziqianlu@...edance.com>
> Reported-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
model")? "Task based throttle time accounting" doesn't touch pelt bits.
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
Reviewed-by: Aaron Lu <ziqianlu@...edance.com>
Tested-by: Aaron Lu <ziqianlu@...edance.com>
Thanks for the fix.
BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
ancestor cfs_rq's pelt clock throttled status but only the first level
cfs_rq's, because the purpose is to have the first level cfs_rq to stay
on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
are just to make sure the list is fully connected. I mean something like
this:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75c615f5ed640..6a6d9200ab93c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
/*
* If a task gets attached to this cfs_rq and before being queued,
@@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
* change, make sure this cfs_rq stays on leaf cfs_rq list to have
* that removed load decayed or it can cause faireness problem.
*/
- if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+ if (add)
list_add_leaf_cfs_rq(cfs_rq);
/* Start to propagate at parent */
@@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
update_load_avg(cfs_rq, se, UPDATE_TG);
- if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+ if (add)
list_add_leaf_cfs_rq(cfs_rq);
}
But this is a different thing and can be taken care of if necessary
later. Current logic doesn't have a problem, it's just not as clear as
the above diff to me.
Powered by blists - more mailing lists