linux-kernel - Re: [PATCH] sched/fair: Decrease util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ebab52de-4e99-4e63-b1e2-e676d854e4be@arm.com>
Date: Fri, 20 Dec 2024 15:48:09 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>,
 Pierre Gondois <pierre.gondois@....com>
Cc: linux-kernel@...r.kernel.org, Chritian Loehle <christian.loehle@....com>,
 Hongyan Xia <hongyan.xia2@....com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

On 20/12/2024 08:47, Vincent Guittot wrote:
> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> <vincent.guittot@...aro.org> wrote:
>>
>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@....com> wrote:
>>>
>>> util_est signal does not decay if the task utilization is lower
>>> than its runnable signal by a value of 10. This was done to keep
>>
>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>> worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?

It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:

...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...

I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?

>>> the util_est signal high in case a task shares a rq with another
>>> task and doesn't obtain a desired running time.
>>>
>>> However, tasks sharing a rq obtain the running time they desire
>>> provided that the rq has some idle time. Indeed, either:
>>> - a CPU is always running. The utilization signal of tasks reflects
>>>   the running time they obtained. This running time depends on the
>>>   niceness of the tasks. A decreasing utilization signal doesn't
>>>   reflect a decrease of the task activity and the util_est signal
>>>   should not be decayed in this case.
>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>>   might be waiting to run, increasing their runnable signal, but
>>>   eventually run to completion. A decreasing utilization signal
>>>   does reflect a decrease of the task activity and the util_est
>>>   signal should be decayed in this case.
>>
>> This is not always true
>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>> util_est
>> Then start a 2nd task with the exact same behavior on the same cpu.
>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>> its util_est but there is still 20ms of idle time. Furthermore,  The
>> util_avg of the 1st task is also around 496 at dequeue but
> 
> the end of the sentence was missing...
> 
> but there is still 20ms of idle time.

But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?

[...]

>>> The initial patch [2] aimed to solve an issue detected while running
>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>> versions are compared:
>>> - base: the current version
>>
>> What do you mean by current version ? tip/sched/core ?
>>
>>> - patch: the new version, with this patch applied
>>> - revert: the initial version, with commit [2] reverted
>>>
>>> Score (higher is better):
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │      0.57 ┆      0.49 ┆       0.58 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy measured with energy counters:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │   1347.13 ┆   2431.67 ┆     510.88 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy computed from util signals and energy model:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> OU ratio in % (ratio of time being overutilized over total time).
>>> The test lasts ~65s:
>>> ┌────────────┬────────────┬─────────────┐
>>> │ base mean  ┆ patch mean ┆ revert mean │
>>> ╞════════════╪════════════╪═════════════╡
>>> │     63.39% ┆     12.48% ┆      12.28% │
>>> └────────────┴────────────┴─────────────┘
>>> ┌───────────┬───────────┬─────────────┐
>>> │ base std  ┆ patch std ┆ revert mean │
>>> ╞═══════════╪═══════════╪═════════════╡
>>> │      0.97 ┆      0.28 ┆        0.88 │
>>> └───────────┴───────────┴─────────────┘
>>>
>>> The energy gain can be explained by the fact that the system is
>>> overutilized during most of the test with the base version.
>>>
>>> During the test, the base condition is evaluated to true ~40%
>>> of the time. The new condition is evaluated to true ~2% of
>>> the time. Preventing util_est signals to decay with the base
>>> condition has a significant impact on the overutilized state
>>> due to an overestimation of the resulting utilization of tasks.
>>>
>>> The score is impacted by the patch, but:
>>> - it is expected to have slightly lower scores with EAS running more
>>>   often
>>> - the base version making the system run at higher frequencies by
>>>   overestimating task utilization, it is expected to have higher scores
>>
>> I'm not sure to get what you are trying to solve here ?

Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.

[...]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3e9ca38512de..d058ab29e52e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>>          * To avoid underestimate of task utilization, skip updates of EWMA if
>>>          * we cannot grant that thread got all CPU time it wanted.
>>>          */
>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>
>> You can't use here the test that is done in
>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>> because this test is only relevant when the rq becomes idle which is
>> not the case here

Do you mean this test ?

util_avg = util_sum / divider

util_sum >= divider * util_avg

with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):

util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX

Why can't we use it here?

>> With this test you skip completely the cases where the task has to
>> share the CPU with others. As an example on the pixel 6, the little

True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.

>> cpus must run more than 1.2 seconds at its max freq before detecting
>> that there is no idle time

BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?

[...]