[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c24fe564-e650-4f39-88b0-43399398b61f@arm.com>
Date: Fri, 9 Jan 2026 18:12:18 +0100
From: Pierre Gondois <pierre.gondois@....com>
To: Christian Loehle <christian.loehle@....com>, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org, vincent.guittot@...aro.org,
dietmar.eggemann@....com
Cc: rafael@...nel.org, qyousef@...alina.io, peterz@...radead.org,
qperret@...gle.com, sven@...npeter.dev
Subject: Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap
CPU
On 12/30/25 10:30, Christian Loehle wrote:
> I'm trying to deliver on my overdue promise of redefining overutilized state.
> My investigation basically lead to redefinition of overutilized state
> bringing very little hard improvements, while it comes with at least
> some risk of worsening platforms and workload combinations I might've
> overlooked, therefore I only concentrate on one, the least
> controversial, for now.
> When a task is alone on a max-cap CPU there's no reason to let it
> trigger OU because it will only ever be placed on another max-cap CPU,
> as such we skip setting overutilized in such a scenario in a careful
> way, namely still letting it trigger if there's any other task or the
> capacity is (usually temporarily) reduced because of system or thermal
> pressure.
> On platforms common in phones this strategy didn't prove useful, as
> even one such a task would already be the majority of the phones'
> thermal (or even power budget) and therefore such a situation not being
> very stable and continuing to attempt EAS on the other CPUs seemed
> unnecessary.
> OTOH there are more and more systems (e.g. apple silicon,
> radxa orion o6, x86 hybrids) where such a situation could be sustained
> and there are also many more max-cap CPUs, so more possibilites for the
> patch to trigger.
>
> For further information and the OSPM discussion see:
> https://www.youtube.com/watch?v=N0tZ8GhhQzc
>
> Radxa orion o6 (capacities: 1024, 279, 279, 279, 279, 905, 905, 866, 866, 984, 984, 1024):
> Mean of 10 Geekbench6.3 iterations (all values are the mean)
> +------------+--------+---------+-------+--------------+
> | Test | patch | score | OU % | OU triggers |
> +------------+--------+---------+-------+--------------+
> | GB6 Single | patch | 1182.4 | 26.14 | 1942.4 |
> | GB6 Single | base | 1186.9 | 71.23 | 573.0 |
> +------------+--------+---------+-------+--------------+
> | GB6 Multi | patch | 5227.7 | 44.11 | 984.5 |
> | GB6 Multi | base | 5395.6 | 53.17 | 773.1 |
> +------------+--------+---------+-------+--------------+
> (OU triggers are overutilized rd 0->1 transitions)
Not really important, but having more/less OU transitions
should not be a criteria right ?
If the goal is to use EAS as much as possible, it would be
better to compare the number of task placement decisions
that go through EAS between the 2 versions.
(I think the numbers are convincing enough,
this is just to discuss).
> GB6 Multi score stdev is 43 for base.
>
> RK3399 ((384, 384, 384, 384)(1024, 1024))
> stress-ng --cpu X --timeout 60s
> Mean of 10 iterations
> +-----------+--------+------+--------------+
> | stress-ng | patch | OU % | OU triggers |
> +-----------+--------+------+--------------+
> | 1x | patch | 0.01 | 10.5 |
> | 1x | base | 99.7 | 4.4 |
> +-----------+--------+------+--------------+
> | 2x | patch | 0.01 | 13.8 |
> | 2x | base | 99.7 | 5.3 |
> +-----------+--------+------+--------------+
> | 3x | patch | 99.8 | 4.1 |
> | 3x | base | 99.8 | 4.6 |
> +-----------+--------+------+--------------+
> (System only has 2 1024-capacity CPUs, so for 3x stress-ng
> patch and base are intended to behave the same.)
>
> M1 Pro ((485, 485) (1024, 1024, 1024) (1024, 1024, 1024))
> (backported to the 6.17-based asahi kernel)
> +-----------+--------+-------+--------------+
> | stress-ng | patch | OU % | OU triggers |
> +-----------+--------+-------+--------------+
> | 1x | patch | 8.26 | 432.0 |
> | 1x | base | 99.14 | 4.2 |
> +-----------+--------+-------+--------------+
> | 2x | patch | 8.79 | 470.2 |
> | 2x | base | 99.21 | 3.8 |
> +-----------+--------+-------+--------------+
> | 4x | patch | 8.99 | 475.2 |
> | 4x | base | 99.17 | 4.6 |
> +-----------+--------+-------+--------------+
> | 6x | patch | 8.81 | 478.8 |
> | 6x | base | 99.14 | 5.0 |
> +-----------+--------+-------+--------------+
> | 7x | patch | 99.21 | 4.0 |
> | 7x | base | 99.27 | 4.2 |
> +-----------+--------+-------+--------------+
>
> Mean of 20 Geekbench 6.3 iterations
> +------------+--------+---------+-------+--------------+
> | Test | patch | score | OU % | OU triggers |
> +------------+--------+---------+-------+--------------+
> | GB6 Single | patch | 2296.9 | 3.99 | 669.4 |
> | GB6 Single | base | 2295.8 | 50.06 | 28.4 |
> +------------+--------+---------+-------+--------------+
> | GB6 Multi | patch | 10621.8 | 18.77 | 636.4 |
> | GB6 Multi | base | 10686.8 | 28.72 | 66.8 |
> +------------+--------+---------+-------+--------------+
>
> Energy numbers are trace-based (lisa.estimate_from_trace()):
> GB6 Single -12.63% energy average (equal score)
> GB6 Multi +1.76% energy average (for equal score runs)
Just to repeat some things that you said in another thread:
-
for the GB6 Multi, it should be expected to have a slightly
lower score as CAS gives better score in general and EAS runs
longer with your patch.
It is however unfortunate to get a slightly higher energy consumption.
-
The focus should be put on GB6 single where the energy saving is
greatly improved
>
> No changes observed with geekbench6 on a Pixel 6 6.12-based with patch backported.
>
> Functional test:
> Using the above described M1 Pro I created an rt-app workload [1]:
> Workload:
> - tskbusy: periodic 100% duty, period 1s, duration 10s (single always-running task)
> - tsk_{a..d}: periodic 5% duty, 16ms period, duration 10s (four small periodic tasks)
> Target system: 8 CPUs (0-7), 2 little (cpu0 & cpu1), 6 big
> Metric: per-task CPU residency (seconds) over the 10s run
> OU metric: time spent in overutilized state / total time; Number of
> OU 0->1 transitions (triggers).
>
> Case A Mainline:
> Small task CPU residency (s), 10s run
> task cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 total
> tsk_a 0.124 0.000 0.000 0.000 0.035 1.791 0.492 0.001 2.444
> tsk_b 0.002 0.000 0.500 0.000 0.000 0.001 0.004 0.000 0.507
> tsk_c 0.000 0.000 0.000 0.000 0.001 0.000 1.895 0.630 2.526
> tsk_d 0.000 0.389 0.001 0.000 0.450 0.000 0.000 0.000 0.840
>
> (Little CPUs 0 & 1 rarely get picked for the small tasks due to CAS' task
> placement, which isn't deterministically "always picking big CPUs", but since
> they make up 6/8 of them this is the common case.)
>
> Overutilized:
> - OU time = 10.0s / 11.0s (ratio 0.909)
> - OU triggers = 7
>
> Case B Patch:
> Small task CPU residency (s), 10s run
> task cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7 total
> tsk_a 0.055 1.907 0.006 0.012 0.002 0.001 0.000 0.005 1.987
> tsk_b 1.845 0.115 0.014 0.000 0.004 0.002 0.000 0.000 1.981
> tsk_c 0.914 1.069 0.007 0.000 0.004 0.005 0.000 0.000 1.999
> tsk_d 1.000 0.985 0.004 0.005 0.000 0.000 0.000 0.000 1.995
>
> Overutilized:
> - OU time = 0.1s / 11.2s (ratio 0.007)
> - OU triggers = 57
>
> (Little CPUs 0 & 1 get picked by the vast majority of wakeups and aren't migrated
> to the big CPUs.)
>
>
> [1]
> LISA's RTApp workload generation description:
>
> rtapp_profile = {
> f'tskbusy': RTAPhase(
> prop_wload=PeriodicWload(
> duty_cycle_pct=100,
> period=1,
> duration=10,
> )
> ),
> f'tsk_a': RTAPhase(
> prop_wload=PeriodicWload(
> duty_cycle_pct=5,
> period=16e-3,
> duration=10,
> )
> ),
> f'tsk_b': RTAPhase(
> prop_wload=PeriodicWload(
> duty_cycle_pct=5,
> period=16e-3,
> duration=10,
> )
> ),
> f'tsk_c': RTAPhase(
> prop_wload=PeriodicWload(
> duty_cycle_pct=5,
> period=16e-3,
> duration=10,
> )
> ),
> f'tsk_d': RTAPhase(
> prop_wload=PeriodicWload(
> duty_cycle_pct=5,
> period=16e-3,
> duration=10,
> )
> )
> }
>
> Christian Loehle (1):
> sched/fair: Ignore OU for lone task on max-cap CPU
>
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
Powered by blists - more mailing lists