linux-kernel - Re: [PATCH] sched/fair: Decrease util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCzz9qTRWkDqQzHyguy=pqnasXVy5smd=PeOqYTdtx1FQ@mail.gmail.com>
Date: Fri, 20 Dec 2024 16:05:29 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Pierre Gondois <pierre.gondois@....com>, linux-kernel@...r.kernel.org, 
	Chritian Loehle <christian.loehle@....com>, Hongyan Xia <hongyan.xia2@....com>, 
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> On 20/12/2024 08:47, Vincent Guittot wrote:
> > On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> > <vincent.guittot@...aro.org> wrote:
> >>
> >> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@....com> wrote:
> >>>
> >>> util_est signal does not decay if the task utilization is lower
> >>> than its runnable signal by a value of 10. This was done to keep
> >>
> >> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> >> worth updating util_est
> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
> the mechanism is too sensitive?

The default config is to follow util_est update

>
> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
>
> ...
> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
> ...
>
> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
> this implementation was tested with perf/energy numbers back then?
>
> >>> the util_est signal high in case a task shares a rq with another
> >>> task and doesn't obtain a desired running time.
> >>>
> >>> However, tasks sharing a rq obtain the running time they desire
> >>> provided that the rq has some idle time. Indeed, either:
> >>> - a CPU is always running. The utilization signal of tasks reflects
> >>>   the running time they obtained. This running time depends on the
> >>>   niceness of the tasks. A decreasing utilization signal doesn't
> >>>   reflect a decrease of the task activity and the util_est signal
> >>>   should not be decayed in this case.
> >>> - a CPU is not always running (i.e. there is some idle time). Tasks
> >>>   might be waiting to run, increasing their runnable signal, but
> >>>   eventually run to completion. A decreasing utilization signal
> >>>   does reflect a decrease of the task activity and the util_est
> >>>   signal should be decayed in this case.
> >>
> >> This is not always true
> >> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> >> compute capacity. its util_avg is up to 674 at dequeue as well as its
> >> util_est
> >> Then start a 2nd task with the exact same behavior on the same cpu.
> >> The util_avg of this 2nd task will be only 496 at dequeue as well as
> >> its util_est but there is still 20ms of idle time. Furthermore,  The
> >> util_avg of the 1st task is also around 496 at dequeue but
> >
> > the end of the sentence was missing...
> >
> > but there is still 20ms of idle time.
>
> But these two tasks are still able to finish there activity within this
> 100ms window. So why should we keep their util_est values high when
> dequeuing?

But then, the util_est decreases from the original value compared to
alone whereas its utilization is the same

>
> [...]
>
> >>> The initial patch [2] aimed to solve an issue detected while running
> >>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> >>> versions are compared:
> >>> - base: the current version
> >>
> >> What do you mean by current version ? tip/sched/core ?
> >>
> >>> - patch: the new version, with this patch applied
> >>> - revert: the initial version, with commit [2] reverted
> >>>
> >>> Score (higher is better):
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │      0.57 ┆      0.49 ┆       0.58 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy measured with energy counters:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │   1347.13 ┆   2431.67 ┆     510.88 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy computed from util signals and energy model:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std  ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> OU ratio in % (ratio of time being overutilized over total time).
> >>> The test lasts ~65s:
> >>> ┌────────────┬────────────┬─────────────┐
> >>> │ base mean  ┆ patch mean ┆ revert mean │
> >>> ╞════════════╪════════════╪═════════════╡
> >>> │     63.39% ┆     12.48% ┆      12.28% │
> >>> └────────────┴────────────┴─────────────┘
> >>> ┌───────────┬───────────┬─────────────┐
> >>> │ base std  ┆ patch std ┆ revert mean │
> >>> ╞═══════════╪═══════════╪═════════════╡
> >>> │      0.97 ┆      0.28 ┆        0.88 │
> >>> └───────────┴───────────┴─────────────┘
> >>>
> >>> The energy gain can be explained by the fact that the system is
> >>> overutilized during most of the test with the base version.
> >>>
> >>> During the test, the base condition is evaluated to true ~40%
> >>> of the time. The new condition is evaluated to true ~2% of
> >>> the time. Preventing util_est signals to decay with the base
> >>> condition has a significant impact on the overutilized state
> >>> due to an overestimation of the resulting utilization of tasks.
> >>>
> >>> The score is impacted by the patch, but:
> >>> - it is expected to have slightly lower scores with EAS running more
> >>>   often
> >>> - the base version making the system run at higher frequencies by
> >>>   overestimating task utilization, it is expected to have higher scores
> >>
> >> I'm not sure to get what you are trying to solve here ?
>
> Yeah, the question is how much perf loss we accept for energy savings?
> IMHO, impossible to answer generically based on one specific
> workload/platform incarnation.

I think that your problem is somewhere else like the fact the /Sum of
util_est is always higher (and sometime far higher) than the actual
max util_avg because we sum the max of all tasks and as you can see in
my example above, the runnable but not running slice decrease the
util_avg of the task.

>
> [...]
>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 3e9ca38512de..d058ab29e52e 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >>>          * To avoid underestimate of task utilization, skip updates of EWMA if
> >>>          * we cannot grant that thread got all CPU time it wanted.
> >>>          */
> >>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> >>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
> >>
> >> You can't use here the test that is done in
> >> update_idle_rq_clock_pelt() to detect if we lost some idle time
> >> because this test is only relevant when the rq becomes idle which is
> >> not the case here
>
> Do you mean this test ?
>
> util_avg = util_sum / divider
>
> util_sum >= divider * util_avg
>
> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
> bound of the window (+ 1024):
>
> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
>
> Why can't we use it here?

because of the example below, it makes the filtering a nop for a very
large time and you will be overutilized far before

>
> >> With this test you skip completely the cases where the task has to
> >> share the CPU with others. As an example on the pixel 6, the little
>
> True. But I assume that's anticipated here. The assumption is that as
> long as there is idle time, tasks get what they want in a time frame.
>
> >> cpus must run more than 1.2 seconds at its max freq before detecting
> >> that there is no idle time
>
> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
> 2.07s (with CPU capacity of Pix5 little CPU = 160)?

yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse

>
> [...]