[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCzz9qTRWkDqQzHyguy=pqnasXVy5smd=PeOqYTdtx1FQ@mail.gmail.com>
Date: Fri, 20 Dec 2024 16:05:29 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Pierre Gondois <pierre.gondois@....com>, linux-kernel@...r.kernel.org,
Chritian Loehle <christian.loehle@....com>, Hongyan Xia <hongyan.xia2@....com>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> On 20/12/2024 08:47, Vincent Guittot wrote:
> > On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> > <vincent.guittot@...aro.org> wrote:
> >>
> >> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@....com> wrote:
> >>>
> >>> util_est signal does not decay if the task utilization is lower
> >>> than its runnable signal by a value of 10. This was done to keep
> >>
> >> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> >> worth updating util_est
> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
> the mechanism is too sensitive?
The default config is to follow util_est update
>
> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
>
> ...
> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
> ...
>
> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
> this implementation was tested with perf/energy numbers back then?
>
> >>> the util_est signal high in case a task shares a rq with another
> >>> task and doesn't obtain a desired running time.
> >>>
> >>> However, tasks sharing a rq obtain the running time they desire
> >>> provided that the rq has some idle time. Indeed, either:
> >>> - a CPU is always running. The utilization signal of tasks reflects
> >>> the running time they obtained. This running time depends on the
> >>> niceness of the tasks. A decreasing utilization signal doesn't
> >>> reflect a decrease of the task activity and the util_est signal
> >>> should not be decayed in this case.
> >>> - a CPU is not always running (i.e. there is some idle time). Tasks
> >>> might be waiting to run, increasing their runnable signal, but
> >>> eventually run to completion. A decreasing utilization signal
> >>> does reflect a decrease of the task activity and the util_est
> >>> signal should be decayed in this case.
> >>
> >> This is not always true
> >> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> >> compute capacity. its util_avg is up to 674 at dequeue as well as its
> >> util_est
> >> Then start a 2nd task with the exact same behavior on the same cpu.
> >> The util_avg of this 2nd task will be only 496 at dequeue as well as
> >> its util_est but there is still 20ms of idle time. Furthermore, The
> >> util_avg of the 1st task is also around 496 at dequeue but
> >
> > the end of the sentence was missing...
> >
> > but there is still 20ms of idle time.
>
> But these two tasks are still able to finish there activity within this
> 100ms window. So why should we keep their util_est values high when
> dequeuing?
But then, the util_est decreases from the original value compared to
alone whereas its utilization is the same
>
> [...]
>
> >>> The initial patch [2] aimed to solve an issue detected while running
> >>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> >>> versions are compared:
> >>> - base: the current version
> >>
> >> What do you mean by current version ? tip/sched/core ?
> >>
> >>> - patch: the new version, with this patch applied
> >>> - revert: the initial version, with commit [2] reverted
> >>>
> >>> Score (higher is better):
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │ 0.57 ┆ 0.49 ┆ 0.58 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy measured with energy counters:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │ 1347.13 ┆ 2431.67 ┆ 510.88 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> Energy computed from util signals and energy model:
> >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
> >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>> ┌───────────┬───────────┬────────────┐
> >>> │ base std ┆ patch std ┆ revert std │
> >>> ╞═══════════╪═══════════╪════════════╡
> >>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> >>> └───────────┴───────────┴────────────┘
> >>>
> >>> OU ratio in % (ratio of time being overutilized over total time).
> >>> The test lasts ~65s:
> >>> ┌────────────┬────────────┬─────────────┐
> >>> │ base mean ┆ patch mean ┆ revert mean │
> >>> ╞════════════╪════════════╪═════════════╡
> >>> │ 63.39% ┆ 12.48% ┆ 12.28% │
> >>> └────────────┴────────────┴─────────────┘
> >>> ┌───────────┬───────────┬─────────────┐
> >>> │ base std ┆ patch std ┆ revert mean │
> >>> ╞═══════════╪═══════════╪═════════════╡
> >>> │ 0.97 ┆ 0.28 ┆ 0.88 │
> >>> └───────────┴───────────┴─────────────┘
> >>>
> >>> The energy gain can be explained by the fact that the system is
> >>> overutilized during most of the test with the base version.
> >>>
> >>> During the test, the base condition is evaluated to true ~40%
> >>> of the time. The new condition is evaluated to true ~2% of
> >>> the time. Preventing util_est signals to decay with the base
> >>> condition has a significant impact on the overutilized state
> >>> due to an overestimation of the resulting utilization of tasks.
> >>>
> >>> The score is impacted by the patch, but:
> >>> - it is expected to have slightly lower scores with EAS running more
> >>> often
> >>> - the base version making the system run at higher frequencies by
> >>> overestimating task utilization, it is expected to have higher scores
> >>
> >> I'm not sure to get what you are trying to solve here ?
>
> Yeah, the question is how much perf loss we accept for energy savings?
> IMHO, impossible to answer generically based on one specific
> workload/platform incarnation.
I think that your problem is somewhere else like the fact the /Sum of
util_est is always higher (and sometime far higher) than the actual
max util_avg because we sum the max of all tasks and as you can see in
my example above, the runnable but not running slice decrease the
util_avg of the task.
>
> [...]
>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 3e9ca38512de..d058ab29e52e 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >>> * To avoid underestimate of task utilization, skip updates of EWMA if
> >>> * we cannot grant that thread got all CPU time it wanted.
> >>> */
> >>> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> >>> + if (rq_no_idle_pelt(rq_of(cfs_rq)))
> >>
> >> You can't use here the test that is done in
> >> update_idle_rq_clock_pelt() to detect if we lost some idle time
> >> because this test is only relevant when the rq becomes idle which is
> >> not the case here
>
> Do you mean this test ?
>
> util_avg = util_sum / divider
>
> util_sum >= divider * util_avg
>
> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
> bound of the window (+ 1024):
>
> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
>
> Why can't we use it here?
because of the example below, it makes the filtering a nop for a very
large time and you will be overutilized far before
>
> >> With this test you skip completely the cases where the task has to
> >> share the CPU with others. As an example on the pixel 6, the little
>
> True. But I assume that's anticipated here. The assumption is that as
> long as there is idle time, tasks get what they want in a time frame.
>
> >> cpus must run more than 1.2 seconds at its max freq before detecting
> >> that there is no idle time
>
> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
> 2.07s (with CPU capacity of Pix5 little CPU = 160)?
yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
>
> [...]
Powered by blists - more mailing lists