linux-kernel - Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20260121034918.GA1303836@bytedance.com>
Date: Wed, 21 Jan 2026 11:49:18 +0800
From: "Aaron Lu" <ziqianlu@...edance.com>
To: "Zicheng Qu" <quzicheng@...wei.com>
Cc: <kprateek.nayak@....com>, <bsegall@...gle.com>, 
	<dhaval@...ux.vnet.ibm.com>, <dietmar.eggemann@....com>, 
	<juri.lelli@...hat.com>, <linux-kernel@...r.kernel.org>, 
	<mgorman@...e.de>, <mingo@...hat.com>, <peterz@...radead.org>, 
	<rostedt@...dmis.org>, <tanghui20@...wei.com>, 
	<vatsa@...ux.vnet.ibm.com>, <vincent.guittot@...aro.org>, 
	<vschneid@...hat.com>, <zhangqiao22@...wei.com>
Subject: Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
> Consider the following sequence on a CPU configured with nohz_full:
> 
> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>    bandwidth control. The gse (cgroup A) where the task P attached is
> dequeued and the CPU switches to idle.
> 
> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>    another cgroup B (not throttled).
> 
>    During sched_move_task(), the task P is observed as queued but not
> running, and therefore no resched_curr() is triggered.
> 
> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>    explicit scheduling event, i.e., resched_curr().
> 
> 4) Later, cgroup A is unthrottled. However, the task P has already been
>    migrated out of cgroup A, so unthrottle_cfs_rq() may observe
> load_weight == 0 and return early without resched_curr() called.

I suppose this is only possible when the unthrottled cfs_rq has been
fully decayed, i.e. !cfs_rq->on_list is true? Because only in that case,
it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
the scenario you have described.

Looking at this logic,  I feel the early return due to
(!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
resched in bottom:

	/* Determine whether we need to wake up potentially idle CPU: */
		if (rq->curr == rq->idle && rq->cfs.nr_queued)
			resched_curr(rq);

should not depend on whether cfs_rq is fully decayed or not...

I think it should be something like this:
- complete the branch if no task enqueued but still on_list;
- only resched_curr() if task gets enqueued

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e71302282671c..e09da54a5d117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
-	if (!cfs_rq->load.weight) {
-		if (!cfs_rq->on_list)
-			return;
+	if (!cfs_rq->load.weight && cfs_rq->on_list) {
 		/*
 		 * Nothing to run but something to decay (on_list)?
 		 * Complete the branch.
@@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_queued)
+	if (rq->curr == rq->idle && cfs_rq->nr_queued)
 		resched_curr(rq);
 }
 

Thoughts?

> At this point, the task P is runnable in cgroup B (not throttled), but
> the CPU remains in do_idle() with no pending reschedule point. The
> system stays in this state until an unrelated event (e.g. a new task
> wakeup or any cases) that can trigger a resched_curr() breaks the
> nohz_full idle state, and then the task P finally gets scheduled.
> 
> The root cause is that sched_move_task() may classify the task as only
> queued, not running, and therefore fails to trigger a resched_curr(),
> while the later unthrottling path no longer has visibility of the
> migrated task.
> 
> Preserve the existing behavior for running tasks by issuing
> resched_curr(), and explicitly invoke check_preempt_curr() for tasks
> that were queued at the time of migration. This ensures that runnable
> tasks are reconsidered for scheduling even when nohz_full suppresses
> periodic ticks.
> 
> Fixes: 29f59db3a74b ("sched: group-scheduler core")
> Signed-off-by: Zicheng Qu <quzicheng@...wei.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@....com>

I haven't been able to reproduce this but the change looks reasonable to
me, so:

Reviewed-by: Aaron Lu <ziqianlu@...edance.com>