[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDGP3299YNh9hgcWj8WrhhDT841KX+w5JWxoKEnqM6h+Q@mail.gmail.com>
Date: Fri, 20 Dec 2024 08:47:47 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Pierre Gondois <pierre.gondois@....com>
Cc: linux-kernel@...r.kernel.org, Chritian Loehle <christian.loehle@....com>,
Hongyan Xia <hongyan.xia2@....com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
<vincent.guittot@...aro.org> wrote:
>
> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@....com> wrote:
> >
> > util_est signal does not decay if the task utilization is lower
> > than its runnable signal by a value of 10. This was done to keep
>
> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> worth updating util_est
>
> > the util_est signal high in case a task shares a rq with another
> > task and doesn't obtain a desired running time.
> >
> > However, tasks sharing a rq obtain the running time they desire
> > provided that the rq has some idle time. Indeed, either:
> > - a CPU is always running. The utilization signal of tasks reflects
> > the running time they obtained. This running time depends on the
> > niceness of the tasks. A decreasing utilization signal doesn't
> > reflect a decrease of the task activity and the util_est signal
> > should not be decayed in this case.
> > - a CPU is not always running (i.e. there is some idle time). Tasks
> > might be waiting to run, increasing their runnable signal, but
> > eventually run to completion. A decreasing utilization signal
> > does reflect a decrease of the task activity and the util_est
> > signal should be decayed in this case.
>
> This is not always true
> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> compute capacity. its util_avg is up to 674 at dequeue as well as its
> util_est
> Then start a 2nd task with the exact same behavior on the same cpu.
> The util_avg of this 2nd task will be only 496 at dequeue as well as
> its util_est but there is still 20ms of idle time. Furthermore, The
> util_avg of the 1st task is also around 496 at dequeue but
the end of the sentence was missing...
but there is still 20ms of idle time.
>
> >
> > ------------
> >
> > Running on a 1024 capacity CPU:
> > - TaskA:
> > - duty_cycle=60%, period=16ms, duration=2s
> > - TaskB:
> > - sleep=2ms (to let TaskA run first), duration=1s
> > - first phase: duty_cycle=20%, period=16ms, duration=1s
> > - second phase: duty_cycle=5%, period=16ms, duration=1s
> >
> > When TaskB starts the second phase, the util_est signal can take up to
> > ~150ms before starting to decrease. Indeed, if TaskB runs after
> > TaskA, its runnable signal will be higher than its util signal by more
> > than 10 units.
> > This creates unnecessary frequency spikes: upon enqueuing TaskB,
> > util_est_cfs is increased by the old value of util_est of TaskB,
> > impacting frequency selection through:
> > sugov_get_util()
> > \-cpu_util_cfs_boost()
> > \-cpu_util()
> > util_est = READ_ONCE(cfs_rq->avg.util_est);
> >
> > ------------
> >
> > Energy impact can also be measured as suggested by Hongyan at [2].
> > On a Pixel6, the following workload is run 10 times:
> > - TaskA:
> > - duty_cycle=20%, duration=0.4s
> > - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
> > - Used to increase the runnable signals of TaskB
> > - TaskB:
> > - sleep=2ms (to let TaskA run first)
> > - first phase: duty_cycle=20%, duration=0.2s
> > - second phase: duty_cycle=5%, duration=0.2s
> > - 4 occurrences of TaskB
> >
> > The duration of the workload is low (0.4s) to emphasis the impact of
> > continuing to run at an overestimated frequency.
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean ┆ patch mean ┆ base std ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆ -9.16 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean ┆ patch mean ┆ base std ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │ 4.8318e9 ┆ 4.0823e9 ┆ 5.1044e8 ┆ 7.5558e8 ┆ -15.51 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > ------------
> >
> > The initial patch [2] aimed to solve an issue detected while running
> > speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> > versions are compared:
> > - base: the current version
>
> What do you mean by current version ? tip/sched/core ?
>
> > - patch: the new version, with this patch applied
> > - revert: the initial version, with commit [2] reverted
> >
> > Score (higher is better):
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │ 0.57 ┆ 0.49 ┆ 0.58 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │ 1347.13 ┆ 2431.67 ┆ 510.88 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> > └───────────┴───────────┴────────────┘
> >
> > OU ratio in % (ratio of time being overutilized over total time).
> > The test lasts ~65s:
> > ┌────────────┬────────────┬─────────────┐
> > │ base mean ┆ patch mean ┆ revert mean │
> > ╞════════════╪════════════╪═════════════╡
> > │ 63.39% ┆ 12.48% ┆ 12.28% │
> > └────────────┴────────────┴─────────────┘
> > ┌───────────┬───────────┬─────────────┐
> > │ base std ┆ patch std ┆ revert mean │
> > ╞═══════════╪═══════════╪═════════════╡
> > │ 0.97 ┆ 0.28 ┆ 0.88 │
> > └───────────┴───────────┴─────────────┘
> >
> > The energy gain can be explained by the fact that the system is
> > overutilized during most of the test with the base version.
> >
> > During the test, the base condition is evaluated to true ~40%
> > of the time. The new condition is evaluated to true ~2% of
> > the time. Preventing util_est signals to decay with the base
> > condition has a significant impact on the overutilized state
> > due to an overestimation of the resulting utilization of tasks.
> >
> > The score is impacted by the patch, but:
> > - it is expected to have slightly lower scores with EAS running more
> > often
> > - the base version making the system run at higher frequencies by
> > overestimating task utilization, it is expected to have higher scores
>
> I'm not sure to get what you are trying to solve here ?
>
> >
> > ------------
> >
> > Decrease util_est when the rq has some idle time instead of a delta
> > between util and runnable signals.
> >
> > [1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
> > [2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
> > https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
> > [3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
> >
> > Signed-off-by: Pierre Gondois <pierre.gondois@....com>
> > ---
> > kernel/sched/fair.c | 2 +-
> > kernel/sched/pelt.h | 23 ++++++++++++++++-------
> > 2 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e9ca38512de..d058ab29e52e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> > * To avoid underestimate of task utilization, skip updates of EWMA if
> > * we cannot grant that thread got all CPU time it wanted.
> > */
> > - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > + if (rq_no_idle_pelt(rq_of(cfs_rq)))
>
> You can't use here the test that is done in
> update_idle_rq_clock_pelt() to detect if we lost some idle time
> because this test is only relevant when the rq becomes idle which is
> not the case here
>
> With this test you skip completely the cases where the task has to
> share the CPU with others. As an example on the pixel 6, the little
> cpus must run more than 1.2 seconds at its max freq before detecting
> that there is no idle time
>
> > goto done;
> >
> >
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index f4f6a0875c66..eaf34f95fd93 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > }
> >
> > /*
> > - * When rq becomes idle, we have to check if it has lost idle time
> > - * because it was fully busy. A rq is fully used when the /Sum util_sum
> > - * is greater or equal to:
> > + * A rq is fully used (or is considered to not have enough idle time)
> > + * when the /Sum util_sum is greater or equal to:
> > * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > * For optimization and computing rounding purpose, we don't take into account
> > * the position in the current window (period_contrib) and we use the higher
> > * bound of util_sum to decide.
> > */
> > -static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +static inline bool rq_no_idle_pelt(struct rq *rq)
> > {
> > - u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > - u32 util_sum = rq->cfs.avg.util_sum;
> > + u32 util_sum, divider;
> > +
> > + divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > + util_sum = rq->cfs.avg.util_sum;
> > util_sum += rq->avg_rt.util_sum;
> > util_sum += rq->avg_dl.util_sum;
> > + return util_sum >= divider;
> > +}
> >
> > +/*
> > + * When rq becomes idle, we have to check if it has lost idle time
> > + * because it was fully busy.
> > + */
> > +static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +{
> > /*
> > * Reflecting stolen time makes sense only if the idle
> > * phase would be present at max capacity. As soon as the
> > @@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > * this case. We keep track of this lost idle time compare to
> > * rq's clock_task.
> > */
> > - if (util_sum >= divider)
> > + if (rq_no_idle_pelt(rq))
> > rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> >
> > _update_idle_rq_clock_pelt(rq);
> > --
> > 2.25.1
> >
Powered by blists - more mailing lists