linux-kernel - Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251014115018.GC41@bytedance>
Date: Tue, 14 Oct 2025 19:50:18 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Hao Jia <jiahao.kernel@...il.com>
Cc: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Matteo Martelli <matteo.martelli@...ethink.co.uk>,
	Michal Koutný <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with
 zero runtime_remaining

On Tue, Oct 14, 2025 at 07:01:15PM +0800, Hao Jia wrote:
> 
> Hello Aaron,
> 
> Thank you for your reply.
> 
> On 2025/10/14 17:11, Aaron Lu wrote:
> > Hi Hao,
> > 
> > On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
> > > 
> > > Hello Aaron,
> > > 
> > > On 2025/9/29 15:46, Aaron Lu wrote:
> > > > When a cfs_rq is to be throttled, its limbo list should be empty and
> > > > that's why there is a warn in tg_throttle_down() for non empty
> > > > cfs_rq->throttled_limbo_list.
> > > > 
> > > > When running a test with the following hierarchy:
> > > > 
> > > >             root
> > > >           /      \
> > > >           A*     ...
> > > >        /  |  \   ...
> > > >           B
> > > >          /  \
> > > >         C*
> > > > 
> > > > where both A and C have quota settings, that warn on non empty limbo list
> > > > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > > > part of the cfs_rq for the sake of simpler representation).
> > > > 
> > > 
> > > I encountered a similar warning a while ago and fixed it. I have a question
> > > I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to
> > > enqueue a task, which requires that the runtime_remaining of task p's entire
> > > task_group hierarchy be greater than 0.
> > > 
> > > In addition to the case you fixed above,
> > > When bandwidth is running normally, Is it possible that there's a corner
> > > case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0
> > > could trigger a similar warning?
> > 
> > Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
> > In this case, B should be throttled and C is a descendent of B so should
> > also be throttled, i.e. C can't be unthrottled when B is in throttled
> > state. Do I understand you correctly?
> > 
> Yes, both A and B have quota set.
> 
> Is there a possible corner case?
> Asynchronous unthrottling causes other running entities to completely
> consume cfs_B->runtime_remaining (cfs_B->runtime_remaining < 0) but not
> completely consume cfs_A->runtime_remaining (cfs_A->runtime_remaining > 0)
> when we call unthrottle_cfs_rq(cfs_rq_A) .

Let me try to understand the situation here: in your described setup,
all three task groups(A, B, C) have quota set?

> 
> When we unthrottle_cfs_rq(cfs_rq_A), cfs_A->runtime_remaining > 0, but if
> cfs_B->runtime_remaining < 0 at this time,

Hmm... if cfs_B->runtime_remaining < 0, why it's not throttled?

> therefore, when enqueue_task_fair(p)->check_enqueue_throttle(cfs_rq_B)->throttle_cfs_rq(cfs_rq_B),

I assume p is a task of group B?
So when A is unthrottled, since p is a throttled task of group B and B
is still throttled, enqueue_task_fair(p) should not happen.

> an warnning may be triggered.
> 
> My core question is:
> When we call unthrottle_cfs_rq(cfs_rq_A), we only check
> cfs_rq_A->runtime_remaining. However,
> enqueue_task_fair(p)->enqueue_entity(C->B->A)->check_enqueue_throttle() does

According to this info, I assume p is a task of group C here. If
unthrottle A would cause enqueuing p, that means: either group C and B
do not have quota set or group C and B are in unthrottled state. 

> require that the runtime_remaining of each task_group level of task p is
> greater than 0.

If group C and B are in unthrottled state, their runtime_remaining
should be > 0.

> 
> Can we guarantee this?

To guarantee this, a warn like below could be used. Can you try in your
setup if you can hit it? Thanks.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ef11783369d7..c347aa28c411a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5908,6 +5908,8 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->throttled_clock_self_time += delta;
 	}
 
+	WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
+
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);