linux-kernel - Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c86d2b0e-9640-46ff-b069-1ffa1805117a@arm.com>
Date: Wed, 16 Apr 2025 11:51:29 +0100
From: Christian Loehle <christian.loehle@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
 rafael.j.wysocki@...el.com, pierre.gondois@....com,
 linux-kernel@...r.kernel.org, qyousef@...alina.io, hongyan.xia2@....com,
 luis.machado@....com, qperret@...gle.com
Subject: Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

On 4/15/25 14:49, Vincent Guittot wrote:
> Hi Christian,
> 
> On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@....com> wrote:
>>
>> On 3/2/25 21:05, Vincent Guittot wrote:
>>> The current Energy Aware Scheduler has some known limitations which have
>>> became more and more visible with features like uclamp as an example. This
>>> serie tries to fix some of those issues:
>>> - tasks stacked on the same CPU of a PD
>>> - tasks stuck on the wrong CPU.
>>>
> 
> ...
> 
>>>
>>>  include/linux/energy_model.h | 111 ++----
>>>  kernel/sched/fair.c          | 721 ++++++++++++++++++++++++-----------
>>>  kernel/sched/sched.h         |   2 +
>>>  3 files changed, 518 insertions(+), 316 deletions(-)
>>>
>>
>> Hi Vincent,
>> so I've invested some time into running tests with the series.
>> To further narrow down which patch we can attribute a change in
>> behavior I've compared the following:
>> - Patches 1 to 3 applied, comparing your proposed feec() (B)
>> only to the baseline feec() (A).
>> - All patches applied, using a static branch to enable (C) and
>> disable (D) push mechanism for misfit tasks (if disabled only
>> the 'tasks stuck on CPU' mechanism triggers here).
>>
>> I've looked at
>> 1) YouTube 4K video playback
>> 2) Dr.Arm (in-house ARM game)
>> 3) VideoScroller which loads a new video every 3s
>> 4) Idle screen on
>> 5) Speedometer2.0 in Chromium
>>
>> The device tested is the Pixel6 with 6.12 kernel + backported
>> scheduler patches.
> 
> What do you mean by "6.12 kernel + backported scheduler patches" ? Do
> you mean android mainline v6.12 ?

Yes, in particular with the following patches backported:
(This series is here in the shortlog)
PM: EM: Add min/max available performance state limits  
sched/fair: Fix variable declaration position  
sched/fair: Do not try to migrate delayed dequeue task  
sched/fair: Rename cfs_rq.nr_running into nr_queued  
sched/fair: Remove unused cfs_rq.idle_nr_running  
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle  
sched/fair: Removed unsued cfs_rq.h_nr_delayed  
sched/fair: Use the new cfs_rq.h_nr_runnable  
sched/fair: Add new cfs_rq.h_nr_runnable  
sched/fair: Rename h_nr_running into h_nr_queued  
sched/eevdf: More PELT vs DELAYED_DEQUEUE  
sched/fair: Fix sched_can_stop_tick() for fair tasks  
sched/fair: optimize the PLACE_LAG when se->vlag is zero  

> 
> I run my test with android mainline v6.13 + scheduler patches for
> v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
> important patches in regards to threads accounting

Which ones in particular do you think are critical?
I'm also happy to just use your branch for testing, so we align on
a common base, if you're willing to share it.
I'm not happy about having to test on backported kernels either, but
as long as this is necessary we might as well just share branches of
Android mainline kernel backports for EAS patches, we all do the
backports anyway.

> 
>> For power measurements the onboard energy-meter is used [1].
> 
> same for me
> 
>>
>> Mainline feec() A is the baseline for all. All workloads are run for
>> 10mins with the exception of Speedometer 2.0
>> (one iteration each for 5 iterations with cooldowns).
> 
> What do you mean exactly by (one iteration each for 5 iterations with
> cooldowns) ?

So for Speedometer 2.0 I do:
Run one iteration.
Wait until device is cooled down (all temp sensors <30C).
Repeat 5x.

> 
>>
>> 1) YouTube 4K video
> 
> I'd like to reproduce this use case because my test with 4k video
> playback shows similar or slightly better power consumption (2%) with
> this patch.
> 
> Do you have details about this use case that you can share ?

Sure, in that case it's just a 5 hour long sample video without
ads in between. I then static-branch between e.g. the two feec()s.
to collect the numbers.
1m of stabilising between static branch switches were energy numbers
are disregarded.

> 
> 
>> +4.5% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +18.5% power with CAS).
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I'm interested to understand why the average frequency increases as
> the OPP remains the 1st level of selection and in case of light loaded
> use cases we should not see much difference. That's what I see on my
> 4k video playback use case

Well the OPPs may be quite far apart and while max-spare-cap strategy
will optimally balance the util within the cluster, this series deviates
from that, so you will raise OPP earlier once the util of the CPUs in
the cluster grow.
For illustration here's the OPP table for the tested Pixel 6:
CPU    Freq (kHz)    ΔFreq   Capacity   ΔCap 
cpu0   300000        0        26         0     
cpu0   574000        274000   50         24    
cpu0   738000        164000   65         15    
cpu0   930000        192000   82         17    
cpu0   1098000       168000   97         15    
cpu0   1197000       99000    106        9     
cpu0   1328000       131000   117        11    
cpu0   1401000       73000    124        7     
cpu0   1598000       197000   141        17    
cpu0   1704000       106000   151        10    
cpu0   1803000       99000    160        9  
cpu4   400000        0        88         0     
cpu4   553000        153000   122        34    
cpu4   696000        143000   153        31    
cpu4   799000        103000   176        23    
cpu4   910000        111000   201        25    
cpu4   1024000       114000   226        25    
cpu4   1197000       173000   264        38    
cpu4   1328000       131000   293        29    
cpu4   1491000       163000   329        36    
cpu4   1663000       172000   367        38    
cpu4   1836000       173000   405        38    
cpu4   1999000       163000   441        36    
cpu4   2130000       131000   470        29    
cpu4   2253000       123000   498        28   
cpu6   500000        0        182        0     
cpu6   851000        351000   311        129   
cpu6   984000        133000   359        48    
cpu6   1106000       122000   404        45    
cpu6   1277000       171000   466        62    
cpu6   1426000       149000   521        55    
cpu6   1582000       156000   578        57    
cpu6   1745000       163000   637        59    
cpu6   1826000       81000    667        30    
cpu6   2048000       222000   748        81    
cpu6   2188000       140000   799        51    
cpu6   2252000       64000    823        24    
cpu6   2401000       149000   877        54    
cpu6   2507000       106000   916        39    
cpu6   2630000       123000   961        45    
cpu6   2704000       74000    988        27    
cpu6   2802000       98000    1024       36

A hypothetical util distribution on the little for OPP0
would be:
0:5 1:16 2:17 3:18
when now placing a util=2 task max-spare-cap will obviously
pick CPU0, while you may deviate form that also picking any
of CPU1-3. For CPU3 even a single util increase will raise
the OPP of the cluster.
As util are never that stable the balancing effect of
max-spare-cap is helping preserve energy.

On big (CPU6) OPP0 -> OPP1 the situation is even worse if the
util numbers above are too small to be convincing.

> 
> And I will also look at why the CAS is better in your case
> 
>> No dropped frames in all tested A to D.
>>
>> 2)  Dr.Arm (in-house ARM game)
>> +9.9% power with all other tested (the regression already shows with B,
>> no further change with C & D).
>> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
>> The power regression comes from increased average frequency on all
>> 3 clusters.
> 
> I supposed that I won't be able to reproduce this one

Not really, although given that the YT case is similar I don't
think this would be a one-off. Probably any comparable 3D action
game will do (our internally is just really nice to automate
obviously).

> 
>>
>> 3) VideoScroller
>> No difference in terms of power for A to D.
>> Specifically even the push mechanism with misfit enabled/disabled
>> doesn't make a noticeable difference in per-cluster energy numbers.
>>
>> 4) Idle screen on
>> No difference in power for all for A to D.
> 
> I see a difference here mainly for DDR power consumption with 7%
> saving compared to mainline and 2% on the CPU clusters

Honestly the stddev on these is so high that something needs to go
quite badly to show something significant in this, just wanted to
include it.

> 
>>
>> 5) Speedometer2.0 in Chromium
>> Both power and score comparable for A to D.
>>
>> As mentioned in the thread already the push mechanism
>> (without misfit tasks) (D) triggers only once every 2-20 minutes,
>> depending on the workload (all tested here were without any
>> UCLAMP_MAX tasks).
>> I also used the device manually just to check if I'm not missing
>> anything here, I wasn't.
>> This push task mechanism shouldn't make any difference without
>> UCLAMP_MAX.
> 
> On the push mechanism side, I'm surprised that you don't get more push
> than once every 2-20 minutes. On the speedometer, I've got around 170
> push fair and 600 check pushable which ends with a task migration
> during the 75 seconds of the test and much more calls that ends with
> the same cpu. This also needs to be compared with the 70% of
> overutilized state during the 75 seconds of the time during which we
> don't push. On light loaded case, the condition is currently to
> conservative to trigger push task mechanism but that's also expected
> in order to be conservative

Does that include misfit pushes? I'd be interested if our results
vastly differ here. Just to reiterate, this is without misfit pushes,
only the "stuck on CPU" case introduced by 5/7.

> 
> The fact that OU triggers too quickly limits the impact of push and feec rework

I'm working on a series here :)

> 
> uclamp_max sees a difference with the push mechanism which is another
> argument for using it.

I don't doubt that, but there's little to test with real-world use-cases
really...

> 
> And this is 1st step is quite conservative before extending the cases
> which can benefit from push and feec rework as explained  at OSPM
> 

Right, I actually do see an appeal of having the push mechanism in fair/EAS,
but of course also the series introducing it should have sufficient convincing
benefits.