[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDbAmYkJ9KKzeFa9yJM4psiN6=eYj6ZN7h0gE0zJGWscQ@mail.gmail.com>
Date: Sat, 30 Nov 2024 11:50:06 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Hongyan Xia <hongyan.xia2@....com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org, qyousef@...alina.io
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
Hi Hongyan,
On Thu, 28 Nov 2024 at 18:24, Hongyan Xia <hongyan.xia2@....com> wrote:
>
> Hi Vincent,
>
> On 30/08/2024 14:03, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >
> > Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> > whereas it is capped to a lower compute capacity. This wrong classification
> > can prevent periodic load balancer to select a group_misfit_task CPU
> > because group_overloaded has higher priority.
> >
> >
> > Patch 2 creates a new EM interface that will be used by Patch 3
> >
> >
> > Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> > others might be a better choice. feec() looks for the CPU with the highest
> > spare capacity in a PD assuming that it will be the best CPU from a energy
> > efficiency PoV because it will require the smallest increase of OPP.
> > This is often but not always true, this policy filters some others CPUs
> > which would be as efficients because of using the same OPP but with less
> > running tasks as an example.
> > In fact, we only care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result having the same energy cost. In
> > such cases, we can use other metrics to select the best CPU with the same
> > energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> > and then the most performant CPU between CPUs.
> >
> > perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> > of the new feec() vs current implementation.
> > sidenote: delayed dequeue has been disable for all tests.
> >
> > 9 iterations of perf bench sched pipe -T -l 80000
> > ops/sec stdev
> > tip/sched/core 13490 (+/- 1.7%)
> > + patches 1-3 14095 (+/- 1.7%) +4.5%
> >
> >
> > When overutilized, the scheduler stops looking for an energy efficient CPU
> > and fallback to the default performance mode. Although this is the best
> > choice when a system is fully overutilized, it also breaks the energy
> > efficiency when one CPU becomes overutilized for a short time because of
> > kworker and/or background activity as an example.
> > Patch 4 calls feec() everytime instead of skipping it when overutlized,
> > and fallback to default performance mode only when feec() can't find a
> > suitable CPU. The main advantage is that the task placement remains more
> > stable especially when there is a short and transient overutilized state.
> > The drawback is that the overhead can be significant for some CPU intensive
> > use cases.
> >
> > The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> >
> > tip/sched/core + patches 1-4
> > Time stdev Time stdev
> > hackbench -l 5120 -g 1 0.724 +/-1.3% 0.765 +/-3.0% (-5.7%)
> > hackbench -l 1280 -g 4 0.740 +/-1.1% 0.768 +/-1.8% (-3.8%)
> > hackbench -l 640 -g 8 0.792 +/-1.3% 0.812 +/-1.6% (-2.6%)
> > hackbench -l 320 -g 16 0.847 +/-1.4% 0.852 +/-1.8% (-0.6%)
> >
> > hackbench -p -l 5120 -g 1 0.878 +/-1.9% 1.115 +/-3.0% (-27%)
> > hackbench -p -l 1280 -g 4 0.789 +/-2.6% 0.862 +/-5.0% (-9.2%)
> > hackbench -p -l 640 -g 8 0.732 +/-1.9% 0.801 +/-4.3% (-9.4%)
> > hackbench -p -l 320 -g 16 0.710 +/-4.7% 0.767 +/-4.9% (-8.1%)
> >
> > hackbench -T -l 5120 -g 1 0.756 +/-3.9% 0.772 +/-1.63 (-2.0%)
> > hackbench -T -l 1280 -g 4 0.725 +/-1.4% 0.737 +/-2.0% (-1.3%)
> > hackbench -T -l 640 -g 8 0.767 +/-1.5% 0.809 +/-2.6% (-5.5%)
> > hackbench -T -l 320 -g 16 0.812 +/-1.2% 0.823 +/-2.2% (-1.4%)
> >
> > hackbench -T -p -l 5120 -g 1 0.941 +/-2.5% 1.190 +/-1.6% (-26%)
> > hackbench -T -p -l 1280 -g 4 0.869 +/-2.5% 0.931 +/-4.9% (-7.2%)
> > hackbench -T -p -l 640 -g 8 0.819 +/-2.4% 0.895 +/-4.6% (-9.3%)
> > hackbench -T -p -l 320 -g 16 0.763 +/-2.6% 0.863 +/-5.0% (-13%)
> >
> > Side note: Both new feec() and current feec() give similar overheads with
> > patch 4.
> >
> > Although the highest reachable CPU throughput is not the only target of EAS,
> > the overhead can be significant in some cases as shown in hackbech results
> > above. That being said I still think it's worth the benefit for the stability
> > of tasks placement and a better control of the power.
> >
> >
> > Patch 5 solves another problem with tasks being stuck on a CPU forever
> > because it doesn't sleep anymore and as a result never wakeup and call
> > feec(). Such task can be detected by comparing util_avg or runnable_avg
> > with the compute capacity of the CPU. Once detected, we can call feec() to
> > check if there is a better CPU for the stuck task. The call can be done in
> > 2 places:
> > - When the task is put back in the runnnable list after its running slice
> > with the balance callback mecanism similarly to the rt/dl push callback.
> > - During cfs tick when there is only 1 running task stuck on the CPU in
> > which case the balance callback can't be used.
> >
> > This push callback doesn't replace the current misfit task mecanism which
> > is already implemented but this could be considered as a follow up serie.
> >
> >
> > This push callback mecanism with the new feec() algorithm ensures that
> > tasks always get a chance to migrate on the best suitable CPU and don't
> > stay stuck on a CPU which is no more the most suitable one. As examples:
> > - A task waking on a big CPU with a uclamp max preventing it to sleep and
> > wake up, can migrate on a smaller CPU once it's more power efficient.
> > - The tasks are spread on CPUs in the PD when they target the same OPP.
> >
> > This series implements some of the topics discussed at OSPM [1]. Other
> > topics will be part of an other serie
> >
> > [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> >
> > Vincent Guittot (5):
> > sched/fair: Filter false overloaded_group case for EAS
> > energy model: Add a get previous state function
> > sched/fair: Rework feec() to use cost instead of spare capacity
> > sched/fair: Use EAS also when overutilized
> > sched/fair: Add push task callback for EAS
> >
> > include/linux/energy_model.h | 18 +
> > kernel/sched/fair.c | 693 +++++++++++++++++++++++------------
> > kernel/sched/sched.h | 2 +
> > 3 files changed, 488 insertions(+), 225 deletions(-)
> >
>
> On second look, I do wonder if this series should be split into
> individual patches or mini-series. Some of the ideas, like
> overloaded_groups or calling EAS at more locations rather than just
> wake-up events, might be easier to review and merge if they are independent.
The series is almost ready, I was waiting for the support of v6.12 on
a device like pixel 6 to run some benchmarks but it is not yet
available publicly at least so I might send the serie without such
figures. I also wanted to test it with delayed dequeued enabled this
time unlike previous version:
https://lore.kernel.org/lkml/20241129161756.3081386-1-vincent.guittot@linaro.org/
Powered by blists - more mailing lists