linux-kernel - Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAfBu9LkYn=VK4d9tyV+0V0CuafZq0E7kyjcZoTHknc7A@mail.gmail.com>
Date: Thu, 19 Dec 2024 17:22:32 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Christian Loehle <christian.loehle@....com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com, 
	rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org, qyousef@...alina.io, 
	hongyan.xia2@....com, pierre.gondois@....com, qperret@...gle.com
Subject: Re: [PATCH 0/7 v2] sched/fair: Rework EAS to handle more cases

On Wed, 18 Dec 2024 at 15:06, Christian Loehle <christian.loehle@....com> wrote:
>
> Hi Vincent,
> just some quick remarks, I won't have time to actually review and test this
> in-depth until January. Sorry for that.

no problem

>
> On 12/17/24 16:07, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs. At now, this only tries to
> > evenly spread the number of runnable tasks on CPUs but this can be
> > improved with other metric like the sched slice duration in a follow up
> > series.
>
>
> Could you elaborate why this is the better strategy instead of max_spare_cap?
> Presumably the highest max_spare_cap has to have rather small tasks if it
> still has more runnable tasks than the other (higher util) CPUs of the PD.

You don't always have a direct relation between nr_runnable,
max_spare_cap and task "size" because of blocked utilization. This
rework keeps the same behavior of highest max_spare_cap in a lot of
cases which includes a spare capacity that make  selecting a different
OPP but It also covers other cases when blocked utilization,
uclamp_min, uclamp_max, cpufreq clamping min/max freq breaks this
relation

While studying trace, we can often see small tasks being packed on a
CPU whereas another one is idle in the same PD

> So nr of runnable tasks should intuitively be the less stable metric (to
> me anyway).

spreading tasks helps to reduce the average scheduling latency which
is beneficial for small tasks. This performance decision is a 1st
simple version which aimed to be improved with other hints like the
sched_slice

>
> For which workloads does it make a difference?
> Which benefit from nr of runnable tasks? Which for max_spare_cap?

I have started to run some tests on Android device but doesn't have
consolidated results yet and I didn't want to delay more the v2

>
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                 ops/sec  stdev
> > tip/sched/core  13001    (+/- 1.2%)
> > + patches 1-3   14349    (+/- 5.4%)  +10.4%
>
> I'm confused, the feec() rework in patch 3 does more comparisons overall,
> so should be slower, but here we have a 10% improvement?

TBH, I didn't expect perf improvement but wanted to test that there is
no regression. I run the tests several time and the results are always
in the same range

> OTOH feec() shouldn't be running much in the first place, since you
> don't run it when overutilized anymore (i.e. keep mainline behavior).

This should not make any difference here as the system is not
overutilized anyway

> The difference should be negligible then and for me it basically is (rk3399
> and -l 5000 to get roughly comparable test duration (results in seconds,
> lower is better), 10 iterations:
> tip/sched/core:
> 20.4573 +-0.0832
> vingu/rework-eas-v2-patches-1-to-3:
> 20.7054 +-0.0411
>
> >
> >
> > Patch 4 removed the now unused em_cpu_energy()
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >   with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >   which case the balance callback can't be used.
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >   wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > Patch 6 adds task misfit migration case in the cfs tick and push callback
> > mecanism to prevent waking up an idle cpu unnecessarily.
> >
> > Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> > trigger the active migration of a task on another CPU.
>
> Would it make sense to further split 5-7 for ease of reviewing?
> Maybe even 1 and 4 as fixes, too?
>
> Regards,
> Christian
>