linux-kernel - Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDbAmYkJ9KKzeFa9yJM4psiN6=eYj6ZN7h0gE0zJGWscQ@mail.gmail.com>
Date: Sat, 30 Nov 2024 11:50:06 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Hongyan Xia <hongyan.xia2@....com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com, 
	rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org, qyousef@...alina.io
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Hi Hongyan,

On Thu, 28 Nov 2024 at 18:24, Hongyan Xia <hongyan.xia2@....com> wrote:
>
> Hi Vincent,
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs.
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> > sidenote: delayed dequeue has been disable for all tests.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> >                  ops/sec  stdev
> > tip/sched/core  13490    (+/- 1.7%)
> > + patches 1-3   14095    (+/- 1.7%)  +4.5%
> >
> >
> > When overutilized, the scheduler stops looking for an energy efficient CPU
> > and fallback to the default performance mode. Although this is the best
> > choice when a system is fully overutilized, it also breaks the energy
> > efficiency when one CPU becomes overutilized for a short time because of
> > kworker and/or background activity as an example.
> > Patch 4 calls feec() everytime instead of skipping it when overutlized,
> > and fallback to default performance mode only when feec() can't find a
> > suitable CPU. The main advantage is that the task placement remains more
> > stable especially when there is a short and transient overutilized state.
> > The drawback is that the overhead can be significant for some CPU intensive
> > use cases.
> >
> > The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> >
> >                                 tip/sched/core        + patches 1-4
> >                              Time    stdev         Time    stdev
> > hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> > hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> > hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> > hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> >
> > hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> > hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> > hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> > hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> >
> > hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> > hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> > hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> > hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> >
> > hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> > hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> > hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> > hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> >
> > Side note: Both new feec() and current feec() give similar overheads with
> > patch 4.
> >
> > Although the highest reachable CPU throughput is not the only target of EAS,
> > the overhead can be significant in some cases as shown in hackbech results
> > above. That being said I still think it's worth the benefit for the stability
> > of tasks placement and a better control of the power.
> >
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> >    with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> >    which case the balance callback can't be used.
> >
> > This push callback doesn't replace the current misfit task mecanism which
> > is already implemented but this could be considered as a follow up serie.
> >
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> >    wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > This series implements some of the topics discussed at OSPM [1]. Other
> > topics will be part of an other serie
> >
> > [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> >
> > Vincent Guittot (5):
> >    sched/fair: Filter false overloaded_group case for EAS
> >    energy model: Add a get previous state function
> >    sched/fair: Rework feec() to use cost instead of spare capacity
> >    sched/fair: Use EAS also when overutilized
> >    sched/fair: Add push task callback for EAS
> >
> >   include/linux/energy_model.h |  18 +
> >   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
> >   kernel/sched/sched.h         |   2 +
> >   3 files changed, 488 insertions(+), 225 deletions(-)
> >
>
> On second look, I do wonder if this series should be split into
> individual patches or mini-series. Some of the ideas, like
> overloaded_groups or calling EAS at more locations rather than just
> wake-up events, might be easier to review and merge if they are independent.

The series is almost ready, I was waiting for the support of v6.12 on
a device like pixel 6 to run some benchmarks but it is not yet
available publicly at least so I might send the serie without such
figures. I also wanted to test it with delayed dequeued enabled this
time unlike previous version:
https://lore.kernel.org/lkml/20241129161756.3081386-1-vincent.guittot@linaro.org/