[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20260121063402.GA1837728@bytedance.com>
Date: Wed, 21 Jan 2026 14:34:02 +0800
From: "Aaron Lu" <ziqianlu@...edance.com>
To: "K Prateek Nayak" <kprateek.nayak@....com>
Cc: "Zicheng Qu" <quzicheng@...wei.com>, <bsegall@...gle.com>,
<dhaval@...ux.vnet.ibm.com>, <dietmar.eggemann@....com>,
<juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>,
<mgorman@...e.de>, <mingo@...hat.com>, <peterz@...radead.org>,
<rostedt@...dmis.org>, <tanghui20@...wei.com>,
<vatsa@...ux.vnet.ibm.com>, <vincent.guittot@...aro.org>,
<vschneid@...hat.com>, <zhangqiao22@...wei.com>
Subject: Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
On Wed, Jan 21, 2026 at 10:54:11AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 1/21/2026 9:19 AM, Aaron Lu wrote:
> > On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
> >> Consider the following sequence on a CPU configured with nohz_full:
> >>
> >> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
> >> bandwidth control. The gse (cgroup A) where the task P attached is
> >> dequeued and the CPU switches to idle.
> >>
> >> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
> >> another cgroup B (not throttled).
> >>
> >> During sched_move_task(), the task P is observed as queued but not
> >> running, and therefore no resched_curr() is triggered.
> >>
> >> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
> >> explicit scheduling event, i.e., resched_curr().
> >>
> >> 4) Later, cgroup A is unthrottled. However, the task P has already been
> >> migrated out of cgroup A, so unthrottle_cfs_rq() may observe
> >> load_weight == 0 and return early without resched_curr() called.
> >
> > I suppose this is only possible when the unthrottled cfs_rq has been
> > fully decayed, i.e. !cfs_rq->on_list is true?
>
> Ack! Since we detach the task from cfs_rq during
> task_change_group_fair(), the cfs_rq_is_decayed() during
> tg_unthrottle_up can return true and we skip putting the cfs_rq on the
> leaf_cfs_rq_list and unthrottle_cfs_rq() will skip the resched.
>
> > Because only in that case,
> > it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
> > the scenario you have described.
>
> Indeed. Happy coincidence that we checked for "rq->cfs.nr_queued" and an
> unrelated unthrottle could still force a resched for a missed one :-)
>
Yes.
For cfs_rqs with no tasks enqueued during unthrottle, we probably should
just return, no matter if this cfs_rq is fully decayed or not. This can
potentially avoid some unnecessary rescheds.
> >
> > Looking at this logic, I feel the early return due to
> > (!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
> > resched in bottom:
> >
> > /* Determine whether we need to wake up potentially idle CPU: */
> > if (rq->curr == rq->idle && rq->cfs.nr_queued)
> > resched_curr(rq);
> >
> > should not depend on whether cfs_rq is fully decayed or not...
>
> But if it is off list, then it doesn't have any tasks to resched
> anyways and in Zicheng scenario too, the cfs_rq won't have any
> tasks either at the time of unthrottle.
>
Right.
What I wanted to say is, whether to do resched or not should depend on
if tasks are woken on this unthrottled hierarchy, but the current logic
only skip resched for fully decayed cfs_rq; for cfs_rqs with no tasks
queued and still on_list, it still attempted that resched condition
check, that's what made me feel strange.
I think the current behavior is done by commit 2630cde26711("sched/fair:
Add ancestors of unthrottled undecayed cfs_rq"). The original behavior
is to simply return if no tasks queued per commit 671fd9dabe52("sched:
Add support for unthrottling group entities"). Although in commit 671fd,
I don't see why rq->cfs.nr_running is used instead of cfs_rq's
nr_running.
> >
> > I think it should be something like this:
> > - complete the branch if no task enqueued but still on_list;
> > - only resched_curr() if task gets enqueued
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e71302282671c..e09da54a5d117 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > /* update hierarchical throttle state */
> > walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> >
> > - if (!cfs_rq->load.weight) {
> > - if (!cfs_rq->on_list)
> > - return;
> > + if (!cfs_rq->load.weight && cfs_rq->on_list) {
> > /*
> > * Nothing to run but something to decay (on_list)?
> > * Complete the branch.
> > @@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > assert_list_leaf_cfs_rq(rq);
> >
> > /* Determine whether we need to wake up potentially idle CPU: */
> > - if (rq->curr == rq->idle && rq->cfs.nr_queued)
> > + if (rq->curr == rq->idle && cfs_rq->nr_queued)
> > resched_curr(rq);
> > }
> >
> >
> > Thoughts?
>
> Yes, checking for cfs_rq->nr_queued should indicate if new tasks were
> woken on this unthrottled hierarchy. Should make it easier to spot the
> scenarios like the one that Zicheng experienced if there are any more
> of those lurking around.
Yes indeed :)
Powered by blists - more mailing lists