linux-kernel - Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d690510c-c3c0-4551-bf18-e1b62269c8cc@arm.com>
Date: Thu, 28 Nov 2024 17:24:06 +0000
From: Hongyan Xia <hongyan.xia2@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Hi Vincent,

On 30/08/2024 14:03, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
> 
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
> 
> 
> Patch 2 creates a new EM interface that will be used by Patch 3
> 
> 
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs.
> 
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
> sidenote: delayed dequeue has been disable for all tests.
> 
> 9 iterations of perf bench sched pipe -T -l 80000
>                  ops/sec  stdev
> tip/sched/core  13490    (+/- 1.7%)
> + patches 1-3   14095    (+/- 1.7%)  +4.5%
> 
> 
> When overutilized, the scheduler stops looking for an energy efficient CPU
> and fallback to the default performance mode. Although this is the best
> choice when a system is fully overutilized, it also breaks the energy
> efficiency when one CPU becomes overutilized for a short time because of
> kworker and/or background activity as an example.
> Patch 4 calls feec() everytime instead of skipping it when overutlized,
> and fallback to default performance mode only when feec() can't find a
> suitable CPU. The main advantage is that the task placement remains more
> stable especially when there is a short and transient overutilized state.
> The drawback is that the overhead can be significant for some CPU intensive
> use cases.
> 
> The overhead of patch 4 has been stressed with hackbench on dragonboard rb5
> 
>                                 tip/sched/core        + patches 1-4
> 			       Time    stdev         Time    stdev
> hackbench -l 5120 -g 1         0.724   +/-1.3%       0.765   +/-3.0% (-5.7%)
> hackbench -l 1280 -g 4         0.740   +/-1.1%       0.768   +/-1.8% (-3.8%)
> hackbench -l 640  -g 8         0.792   +/-1.3%       0.812   +/-1.6% (-2.6%)
> hackbench -l 320  -g 16        0.847   +/-1.4%       0.852   +/-1.8% (-0.6%)
> 
> hackbench -p -l 5120 -g 1      0.878   +/-1.9%       1.115   +/-3.0% (-27%)
> hackbench -p -l 1280 -g 4      0.789   +/-2.6%       0.862   +/-5.0% (-9.2%)
> hackbench -p -l 640  -g 8      0.732   +/-1.9%       0.801   +/-4.3% (-9.4%)
> hackbench -p -l 320  -g 16     0.710   +/-4.7%       0.767   +/-4.9% (-8.1%)
> 
> hackbench -T -l 5120 -g 1      0.756   +/-3.9%       0.772   +/-1.63 (-2.0%)
> hackbench -T -l 1280 -g 4      0.725   +/-1.4%       0.737   +/-2.0% (-1.3%)
> hackbench -T -l 640  -g 8      0.767   +/-1.5%       0.809   +/-2.6% (-5.5%)
> hackbench -T -l 320  -g 16     0.812   +/-1.2%       0.823   +/-2.2% (-1.4%)
> 
> hackbench -T -p -l 5120 -g 1   0.941   +/-2.5%       1.190   +/-1.6% (-26%)
> hackbench -T -p -l 1280 -g 4   0.869   +/-2.5%       0.931   +/-4.9% (-7.2%)
> hackbench -T -p -l 640  -g 8   0.819   +/-2.4%       0.895   +/-4.6% (-9.3%)
> hackbench -T -p -l 320  -g 16  0.763   +/-2.6%       0.863   +/-5.0% (-13%)
> 
> Side note: Both new feec() and current feec() give similar overheads with
> patch 4.
> 
> Although the highest reachable CPU throughput is not the only target of EAS,
> the overhead can be significant in some cases as shown in hackbech results
> above. That being said I still think it's worth the benefit for the stability
> of tasks placement and a better control of the power.
> 
> 
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
>    with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
>    which case the balance callback can't be used.
> 
> This push callback doesn't replace the current misfit task mecanism which
> is already implemented but this could be considered as a follow up serie.
> 
> 
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
>    wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
> 
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
> Vincent Guittot (5):
>    sched/fair: Filter false overloaded_group case for EAS
>    energy model: Add a get previous state function
>    sched/fair: Rework feec() to use cost instead of spare capacity
>    sched/fair: Use EAS also when overutilized
>    sched/fair: Add push task callback for EAS
> 
>   include/linux/energy_model.h |  18 +
>   kernel/sched/fair.c          | 693 +++++++++++++++++++++++------------
>   kernel/sched/sched.h         |   2 +
>   3 files changed, 488 insertions(+), 225 deletions(-)
> 

On second look, I do wonder if this series should be split into 
individual patches or mini-series. Some of the ideas, like 
overloaded_groups or calling EAS at more locations rather than just 
wake-up events, might be easier to review and merge if they are independent.