[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d690510c-c3c0-4551-bf18-e1b62269c8cc@arm.com>
Date: Thu, 28 Nov 2024 17:24:06 +0000
From: Hongyan Xia <hongyan.xia2@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
Hi Vincent,
On 30/08/2024 14:03, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
>
> Patch 2 creates a new EM interface that will be used by Patch 3
>
>
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs.
>
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> sidenote: delayed dequeue has been disable for all tests.
>
> 9 iterations of perf bench sched pipe -T -l 80000
> ops/sec stdev
> tip/sched/core 13490 (+/- 1.7%)
> + patches 1-3 14095 (+/- 1.7%) +4.5%
>
>
> When overutilized, the scheduler stops looking for an energy efficient CPU
> and fallback to the default performance mode. Although this is the best
> choice when a system is fully overutilized, it also breaks the energy
> efficiency when one CPU becomes overutilized for a short time because of
> kworker and/or background activity as an example.
> Patch 4 calls feec() everytime instead of skipping it when overutlized,
> and fallback to default performance mode only when feec() can't find a
> suitable CPU. The main advantage is that the task placement remains more
> stable especially when there is a short and transient overutilized state.
> The drawback is that the overhead can be significant for some CPU intensive
> use cases.
>
> The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
>
> tip/sched/core + patches 1-4
> Time stdev Time stdev
> hackbench -l 5120 -g 1 0.724 +/-1.3% 0.765 +/-3.0% (-5.7%)
> hackbench -l 1280 -g 4 0.740 +/-1.1% 0.768 +/-1.8% (-3.8%)
> hackbench -l 640 -g 8 0.792 +/-1.3% 0.812 +/-1.6% (-2.6%)
> hackbench -l 320 -g 16 0.847 +/-1.4% 0.852 +/-1.8% (-0.6%)
>
> hackbench -p -l 5120 -g 1 0.878 +/-1.9% 1.115 +/-3.0% (-27%)
> hackbench -p -l 1280 -g 4 0.789 +/-2.6% 0.862 +/-5.0% (-9.2%)
> hackbench -p -l 640 -g 8 0.732 +/-1.9% 0.801 +/-4.3% (-9.4%)
> hackbench -p -l 320 -g 16 0.710 +/-4.7% 0.767 +/-4.9% (-8.1%)
>
> hackbench -T -l 5120 -g 1 0.756 +/-3.9% 0.772 +/-1.63 (-2.0%)
> hackbench -T -l 1280 -g 4 0.725 +/-1.4% 0.737 +/-2.0% (-1.3%)
> hackbench -T -l 640 -g 8 0.767 +/-1.5% 0.809 +/-2.6% (-5.5%)
> hackbench -T -l 320 -g 16 0.812 +/-1.2% 0.823 +/-2.2% (-1.4%)
>
> hackbench -T -p -l 5120 -g 1 0.941 +/-2.5% 1.190 +/-1.6% (-26%)
> hackbench -T -p -l 1280 -g 4 0.869 +/-2.5% 0.931 +/-4.9% (-7.2%)
> hackbench -T -p -l 640 -g 8 0.819 +/-2.4% 0.895 +/-4.6% (-9.3%)
> hackbench -T -p -l 320 -g 16 0.763 +/-2.6% 0.863 +/-5.0% (-13%)
>
> Side note: Both new feec() and current feec() give similar overheads with
> patch 4.
>
> Although the highest reachable CPU throughput is not the only target of EAS,
> the overhead can be significant in some cases as shown in hackbech results
> above. That being said I still think it's worth the benefit for the stability
> of tasks placement and a better control of the power.
>
>
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
> with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
> which case the balance callback can't be used.
>
> This push callback doesn't replace the current misfit task mecanism which
> is already implemented but this could be considered as a follow up serie.
>
>
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
> wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
>
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
>
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
>
> Vincent Guittot (5):
> sched/fair: Filter false overloaded_group case for EAS
> energy model: Add a get previous state function
> sched/fair: Rework feec() to use cost instead of spare capacity
> sched/fair: Use EAS also when overutilized
> sched/fair: Add push task callback for EAS
>
> include/linux/energy_model.h | 18 +
> kernel/sched/fair.c | 693 +++++++++++++++++++++++------------
> kernel/sched/sched.h | 2 +
> 3 files changed, 488 insertions(+), 225 deletions(-)
>
On second look, I do wonder if this series should be split into
individual patches or mini-series. Some of the ideas, like
overloaded_groups or calling EAS at more locations rather than just
wake-up events, might be easier to review and merge if they are independent.
Powered by blists - more mailing lists