linux-kernel - Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ee48e0f3-7b48-4f1c-a1d8-07779e35e949@amd.com>
Date: Wed, 21 Jan 2026 10:54:11 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Zicheng Qu <quzicheng@...wei.com>
CC: <bsegall@...gle.com>, <dhaval@...ux.vnet.ibm.com>,
	<dietmar.eggemann@....com>, <juri.lelli@...hat.com>,
	<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
	<peterz@...radead.org>, <rostedt@...dmis.org>, <tanghui20@...wei.com>,
	<vatsa@...ux.vnet.ibm.com>, <vincent.guittot@...aro.org>,
	<vschneid@...hat.com>, <zhangqiao22@...wei.com>
Subject: Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks
 out of throttled cgroups

Hello Aaron,

On 1/21/2026 9:19 AM, Aaron Lu wrote:
> On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
>> Consider the following sequence on a CPU configured with nohz_full:
>>
>> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>>    bandwidth control. The gse (cgroup A) where the task P attached is
>> dequeued and the CPU switches to idle.
>>
>> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>>    another cgroup B (not throttled).
>>
>>    During sched_move_task(), the task P is observed as queued but not
>> running, and therefore no resched_curr() is triggered.
>>
>> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>>    explicit scheduling event, i.e., resched_curr().
>>
>> 4) Later, cgroup A is unthrottled. However, the task P has already been
>>    migrated out of cgroup A, so unthrottle_cfs_rq() may observe
>> load_weight == 0 and return early without resched_curr() called.
> 
> I suppose this is only possible when the unthrottled cfs_rq has been
> fully decayed, i.e. !cfs_rq->on_list is true?

Ack! Since we detach the task from cfs_rq during
task_change_group_fair(), the cfs_rq_is_decayed() during
tg_unthrottle_up can return true and we skip putting the cfs_rq on the
leaf_cfs_rq_list and unthrottle_cfs_rq() will skip the resched.

> Because only in that case,
> it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
> the scenario you have described.

Indeed. Happy coincidence that we checked for "rq->cfs.nr_queued" and an
unrelated unthrottle could still force a resched for a missed one :-)

> 
> Looking at this logic,  I feel the early return due to
> (!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
> resched in bottom:
> 
> 	/* Determine whether we need to wake up potentially idle CPU: */
> 		if (rq->curr == rq->idle && rq->cfs.nr_queued)
> 			resched_curr(rq);
> 
> should not depend on whether cfs_rq is fully decayed or not...

But if it is off list, then it doesn't have any tasks to resched
anyways and in Zicheng scenario too, the cfs_rq won't have any
tasks either at the time of unthrottle.

> 
> I think it should be something like this:
> - complete the branch if no task enqueued but still on_list;
> - only resched_curr() if task gets enqueued
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e71302282671c..e09da54a5d117 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	/* update hierarchical throttle state */
>  	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>  
> -	if (!cfs_rq->load.weight) {
> -		if (!cfs_rq->on_list)
> -			return;
> +	if (!cfs_rq->load.weight && cfs_rq->on_list) {
>  		/*
>  		 * Nothing to run but something to decay (on_list)?
>  		 * Complete the branch.
> @@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	assert_list_leaf_cfs_rq(rq);
>  
>  	/* Determine whether we need to wake up potentially idle CPU: */
> -	if (rq->curr == rq->idle && rq->cfs.nr_queued)
> +	if (rq->curr == rq->idle && cfs_rq->nr_queued)
>  		resched_curr(rq);
>  }
>  
> 
> Thoughts?

Yes, checking for cfs_rq->nr_queued should indicate if new tasks were
woken on this unthrottled hierarchy. Should make it easier to spot the
scenarios like the one that Zicheng experienced if there are any more
of those lurking around.

-- 
Thanks and Regards,
Prateek