[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260113131134.n4ixed2awnikgmeq@airbuntu>
Date: Tue, 13 Jan 2026 13:11:34 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Christian Loehle <christian.loehle@....com>
Cc: linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rafael@...nel.org, peterz@...radead.org, pierre.gondois@....com,
qperret@...gle.com, sven@...npeter.dev
Subject: Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap
CPU
On 12/30/25 09:30, Christian Loehle wrote:
> I'm trying to deliver on my overdue promise of redefining overutilized state.
> My investigation basically lead to redefinition of overutilized state
> bringing very little hard improvements, while it comes with at least
> some risk of worsening platforms and workload combinations I might've
> overlooked, therefore I only concentrate on one, the least
> controversial, for now.
What are the controversial bits?
This is a step forward, but not sure it is in the right direction. The concept
of a *cpu* being overutilized === rd is overutilized no longer makes sense
since misfit was decoupled from this logic which was the sole reason to
require this check at CPU level. Overutilized state is, rightly, set at the
rootdomain level. And the check makes sense to be done at that level too by
traversing the perf domains and seeing if we are in a state that requires
moving tasks around. Which should be done in update_{sg,sd}_lb_stats() logic
only.
I guess the difficult question (which might be what you're referring to as
controversial), is at what point we can no longer pack (use EAS) and must
distribute tasks around?
I think this question is limited by what the lb can do today. With push lb,
I believe the current global lb is likely to be unnecessary in small systems
(single LLC) since it can shuffle things around immediately to handle misfit
and overload.
On top of that, what can the existing global lb do? I am not sure to be honest.
The system has to have a number of long running tasks > num_cpus for it to be
useful. But given util signal will lose its meaning under these circumstances,
I am not sure the global lb can do a better job than push lb trying to move
these tasks around. But it could do a more comprehensive job in one go? I'll
defer to Vincent, he probably more able to answer this from the top of his
head. But the answer to this question is the key to how we want to define this
*system* is overutilized state.
Assuming this is on top of push lb, I believe something like below which will
trigger overutilized only if all cpus are overutilized (ie system is nearly
maxed out (has 20% or less headroom)) is a good starting point at least.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..ba08f4aefa03 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6814,17 +6814,6 @@ static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
trace_sched_overutilized_tp(rd, flag);
}
-static inline void check_update_overutilized_status(struct rq *rq)
-{
- /*
- * overutilized field is used for load balancing decisions only
- * if energy aware scheduler is being used
- */
-
- if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
- set_rd_overutilized(rq->rd, 1);
-}
-
/* Runqueue only has SCHED_IDLE tasks enqueued */
static int sched_idle_rq(struct rq *rq)
{
@@ -6968,23 +6957,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* At this point se is NULL and we are at root level*/
add_nr_running(rq, 1);
- /*
- * Since new tasks are assigned an initial util_avg equal to
- * half of the spare capacity of their CPU, tiny tasks have the
- * ability to cross the overutilized threshold, which will
- * result in the load balancer ruining all the task placement
- * done by EAS. As a way to mitigate that effect, do not account
- * for the first enqueue operation of new tasks during the
- * overutilized flag detection.
- *
- * A better way of solving this problem would be to wait for
- * the PELT signals of tasks to converge before taking them
- * into account, but that is not straightforward to implement,
- * and the following generally works well enough in practice.
- */
- if (!task_new)
- check_update_overutilized_status(rq);
-
assert_list_leaf_cfs_rq(rq);
hrtick_update(rq);
@@ -10430,8 +10402,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;
sgs->sum_nr_running += nr_running;
- if (cpu_overutilized(i))
- *sg_overutilized = 1;
+ *sg_overutilized &= cpu_overutilized(i);
/*
* No need to call idle_cpu() if nr_running is not 0
@@ -11087,7 +11058,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
unsigned long sum_util = 0;
- bool sg_overloaded = 0, sg_overutilized = 0;
+ bool sg_overloaded = 0, sg_overutilized = 1;
do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -13378,7 +13349,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);
update_misfit_status(curr, rq);
- check_update_overutilized_status(task_rq(curr));
task_tick_core(rq, curr);
}
Powered by blists - more mailing lists