linux-kernel - Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com>
Date: Tue, 14 Oct 2025 15:43:10 +0800
From: Hao Jia <jiahao.kernel@...il.com>
To: Aaron Lu <ziqianlu@...edance.com>,
 Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 Peter Zijlstra <peterz@...radead.org>,
 Chengming Zhou <chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>,
 Ingo Molnar <mingo@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Xi Wang <xii@...gle.com>
Cc: linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
 Chuyi Zhou <zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>,
 Florian Bezdeka <florian.bezdeka@...mens.com>,
 Songtang Liu <liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>,
 Matteo Martelli <matteo.martelli@...ethink.co.uk>,
 Michal Koutný <mkoutny@...e.com>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with
 zero runtime_remaining


Hello Aaron,

On 2025/9/29 15:46, Aaron Lu wrote:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
> 
> When running a test with the following hierarchy:
> 
>            root
>          /      \
>          A*     ...
>       /  |  \   ...
>          B
>         /  \
>        C*
> 
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
> 

I encountered a similar warning a while ago and fixed it. I have a 
question I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls 
enqueue_task_fair(p) to enqueue a task, which requires that the 
runtime_remaining of task p's entire task_group hierarchy be greater than 0.

In addition to the case you fixed above,
When bandwidth is running normally, Is it possible that there's a corner 
case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 
0  could trigger a similar warning?

So, I previously tried to fix this issue using the following code, 
adding the ENQUEUE_THROTTLE flag to ensure that tasks enqueued in 
tg_unthrottle_up() aren't throttled.

---
  kernel/sched/fair.c  | 6 ++++--
  kernel/sched/sched.h | 1 +
  2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e..128efa2eba57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,7 +5290,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
  	se->on_rq = 1;

  	if (cfs_rq->nr_queued == 1) {
-		check_enqueue_throttle(cfs_rq);
+		if (!(flags & ENQUEUE_THROTTLE))
+			check_enqueue_throttle(cfs_rq);
+
  		list_add_leaf_cfs_rq(cfs_rq);
  #ifdef CONFIG_CFS_BANDWIDTH
  		if (cfs_rq->pelt_clock_throttled) {
@@ -5905,7 +5907,7 @@ static int tg_unthrottle_up(struct task_group *tg, 
void *data)
  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, 
throttle_node) {
  		list_del_init(&p->throttle_node);
  		p->throttled = false;
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
  	}

  	/* Add cfs_rq with load or one or more already running entities to 
the list */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c14..871dfb761676 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2358,6 +2358,7 @@ extern const u32		sched_prio_to_wmult[40];
  #define ENQUEUE_MIGRATING	0x100
  #define ENQUEUE_DELAYED		0x200
  #define ENQUEUE_RQ_SELECTED	0x400
+#define ENQUEUE_THROTTLE	0x800

  #define RETRY_TASK		((void *)-1UL)
---

Unfortunately, I tried to build some tests locally and didn't reproduce 
this corner case.

Thanks,
Hao