[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260206172341.4p5rv7o6dxv4l3la@airbuntu>
Date: Fri, 6 Feb 2026 17:23:41 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Christian Loehle <christian.loehle@....com>
Cc: linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rafael@...nel.org, peterz@...radead.org, pierre.gondois@....com,
qperret@...gle.com, sven@...npeter.dev
Subject: Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap
CPU
On 01/15/26 11:17, Christian Loehle wrote:
> On 1/13/26 13:11, Qais Yousef wrote:
> > On 12/30/25 09:30, Christian Loehle wrote:
> >> I'm trying to deliver on my overdue promise of redefining overutilized state.
> >> My investigation basically lead to redefinition of overutilized state
> >> bringing very little hard improvements, while it comes with at least
> >> some risk of worsening platforms and workload combinations I might've
> >> overlooked, therefore I only concentrate on one, the least
> >> controversial, for now.
> >
> > What are the controversial bits?
> >
> > This is a step forward, but not sure it is in the right direction. The concept
> > of a *cpu* being overutilized === rd is overutilized no longer makes sense
> > since misfit was decoupled from this logic which was the sole reason to
> > require this check at CPU level. Overutilized state is, rightly, set at the
> > rootdomain level. And the check makes sense to be done at that level too by
> > traversing the perf domains and seeing if we are in a state that requires
> > moving tasks around. Which should be done in update_{sg,sd}_lb_stats() logic
> > only.
> >
> > I guess the difficult question (which might be what you're referring to as
> > controversial), is at what point we can no longer pack (use EAS) and must
> > distribute tasks around?
>
> And that is precisely the 'controversial bits', I didn't want to touch them
> with this patch specifically.
What makes it controversial? I don't think it is really controversial. Maybe
you're referring to some offline discussion or I missed something on the list.
The concept of a *cpu* being overutilized doesn't make sense for the purpose of
this overutilized state. Trying to make it better in some particular case is
not really moving us in the right direction. And not discussing what's the
right direction doesn't move us in any direction :)
> A more holistic redefinition of OU is still on the table, but it needs to
> a) Still fulfill the requirements we want from it (guarantee of accurate PELT
> values because compute capacity was 'always' provided, switching to throughput
> maximization when needed).
I think the PELT is very inaccurate :-) You saw my talk about invariance and
black hole effect?
If you view this problem as PELT accuracy, this is a problem. The code was
tightly coupled to misfit logic, which it was decoupled from and these
cpu_overutilized() checks are overzealous can be safely removed from many
locations to start with. We need to focus on the concept of system
overutilized. Even if you keep the current logic as-is but just move the checks
to the right place when deciding to do load balance where we take the global
view of the system's state. Not on context switch etc which I think were to
help misfit to trigger?
> b) Provide sufficient testing to convince us of not regressing anything majorly
> on the quite diverse EAS platforms we have today.
I don't think the testing effort is that hard really. Things that need multi
core performance should give us indications. GB MT is one of them, but you can
try speedometer with code compilation (limited to fewer cores than NCPUS) in
the background for instance to see how much this affect the score.
I'd agree it is hard if we don't know under what conditions it is supposed to
help. Which is my main point here. It is supposed to be useful under specific
scenarios only. And these scenarios are NOT tied to cpu state, but global
system state. And I think we can reason about them.
When packing on a PD is worse than distributing? It is definitely not when
a CPU is saturated. feec() has improved a lot over the years and does
distribute load a lot better than its earlier days. The question is when does
it fail?
I think under few scenarios:
1. Number of tasks >> number of cpus
2. Many of these tasks are long running and won't sleep and wake up again for
feec()/wake up to distribute them again.
It will need then to help move those long running tasks to idle cpus as they
become available. But if tasks are sleeping and waking up then they'd be
distributed without any additional help. If not, the fix is to make wake up
path smarter.
It also can help when there are no idle time but many tasks keep waking up.
Some tasks can get stuck enqueued for a long time where we can have nr_running
high on one cpu, but I'd argue we have issues with wake up path packing when
the system is loaded. Even feec() shouldn't do that. Still lb is useful because
enqueued task can't go to sleep even if they need to run for a short time if
they are not given a chance to.
Do you have other scenarios in mind? I think breaking the problem based on
benefits would help advance the code and clarify what satisfactory tests are
required that it behaves correctly. It seems you imply we can't know where it
is supposed to help with to test sufficiently it is not a problem, and here
where I disagree. We should be able to quantify and demonstrate where it should
help.
>
> I think $SUBJECT does a) and b) well, but of course it's for improving a
> specific set of systems and doesn't address the issues with OU that have been
> named in the past.
>
> >
> > I think this question is limited by what the lb can do today. With push lb,
> > I believe the current global lb is likely to be unnecessary in small systems
> > (single LLC) since it can shuffle things around immediately to handle misfit
> > and overload.
> >
> > On top of that, what can the existing global lb do? I am not sure to be honest.
> > The system has to have a number of long running tasks > num_cpus for it to be
> > useful. But given util signal will lose its meaning under these circumstances,
> > I am not sure the global lb can do a better job than push lb trying to move
> > these tasks around. But it could do a more comprehensive job in one go? I'll
> > defer to Vincent, he probably more able to answer this from the top of his
> > head. But the answer to this question is the key to how we want to define this
> > *system* is overutilized state.
> >
> > Assuming this is on top of push lb, I believe something like below which will
> > trigger overutilized only if all cpus are overutilized (ie system is nearly
> > maxed out (has 20% or less headroom)) is a good starting point at least.
>
> It's an approach, but it needs a lot of data to convince everyone that
> push lb + much more liberal OU state outperforms current global LB OU.
>
> Given this is not really about defining OU in a final state, any comments from
> you and Vincent on $SUBJECT and the problem it's addressing would be
> much appreciated!
I think you're avoiding the problem. And testing effort is not really that
different in both cases IMO.
In my view generally our load balancer is not great and very slow to react.
I do believe the push lb will make this overutilized state completely
unnecessary. But we shall see :)
I am not a fan of this band aid. But as I said, makes things better but not
moving us in the right direction. I'd rather see discussions in the latter.
Burying it around with we'll do it later and it's controversial is what
concerns me the most and makes me not keen in taking this small improvement.
But if Peter or Vincent would see it helpful no real objection from me FWIW.
I just think it's not hard to do better.
Cheers
--
Qais Yousef
Powered by blists - more mailing lists