linux-kernel - Re: [PATCH] sched/fair: Decrease util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDGP3299YNh9hgcWj8WrhhDT841KX+w5JWxoKEnqM6h+Q@mail.gmail.com>
Date: Fri, 20 Dec 2024 08:47:47 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Pierre Gondois <pierre.gondois@....com>
Cc: linux-kernel@...r.kernel.org, Chritian Loehle <christian.loehle@....com>, 
	Hongyan Xia <hongyan.xia2@....com>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
<vincent.guittot@...aro.org> wrote:
>
> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@....com> wrote:
> >
> > util_est signal does not decay if the task utilization is lower
> > than its runnable signal by a value of 10. This was done to keep
>
> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> worth updating util_est
>
> > the util_est signal high in case a task shares a rq with another
> > task and doesn't obtain a desired running time.
> >
> > However, tasks sharing a rq obtain the running time they desire
> > provided that the rq has some idle time. Indeed, either:
> > - a CPU is always running. The utilization signal of tasks reflects
> >   the running time they obtained. This running time depends on the
> >   niceness of the tasks. A decreasing utilization signal doesn't
> >   reflect a decrease of the task activity and the util_est signal
> >   should not be decayed in this case.
> > - a CPU is not always running (i.e. there is some idle time). Tasks
> >   might be waiting to run, increasing their runnable signal, but
> >   eventually run to completion. A decreasing utilization signal
> >   does reflect a decrease of the task activity and the util_est
> >   signal should be decayed in this case.
>
> This is not always true
> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> compute capacity. its util_avg is up to 674 at dequeue as well as its
> util_est
> Then start a 2nd task with the exact same behavior on the same cpu.
> The util_avg of this 2nd task will be only 496 at dequeue as well as
> its util_est but there is still 20ms of idle time. Furthermore,  The
> util_avg of the 1st task is also around 496 at dequeue but

the end of the sentence was missing...

but there is still 20ms of idle time.

>
> >
> > ------------
> >
> > Running on a 1024 capacity CPU:
> > - TaskA:
> >   - duty_cycle=60%, period=16ms, duration=2s
> > - TaskB:
> >   - sleep=2ms (to let TaskA run first), duration=1s
> >   - first phase: duty_cycle=20%, period=16ms, duration=1s
> >   - second phase: duty_cycle=5%, period=16ms, duration=1s
> >
> > When TaskB starts the second phase, the util_est signal can take up to
> > ~150ms before starting to decrease. Indeed, if TaskB runs after
> > TaskA, its runnable signal will be higher than its util signal by more
> > than 10 units.
> > This creates unnecessary frequency spikes: upon enqueuing TaskB,
> > util_est_cfs is increased by the old value of util_est of TaskB,
> > impacting frequency selection through:
> > sugov_get_util()
> > \-cpu_util_cfs_boost()
> >   \-cpu_util()
> >                 util_est = READ_ONCE(cfs_rq->avg.util_est);
> >
> > ------------
> >
> > Energy impact can also be measured as suggested by Hongyan at [2].
> > On a Pixel6, the following workload is run 10 times:
> > - TaskA:
> >   - duty_cycle=20%, duration=0.4s
> >   - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
> >   - Used to increase the runnable signals of TaskB
> > - TaskB:
> >   - sleep=2ms (to let TaskA run first)
> >   - first phase: duty_cycle=20%, duration=0.2s
> >   - second phase: duty_cycle=5%, duration=0.2s
> >   - 4 occurrences of TaskB
> >
> > The duration of the workload is low (0.4s) to emphasis the impact of
> > continuing to run at an overestimated frequency.
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> > │ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
> > ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> > │  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
> > └────────────┴────────────┴───────────┴───────────┴───────────┘
> >
> > ------------
> >
> > The initial patch [2] aimed to solve an issue detected while running
> > speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> > versions are compared:
> > - base: the current version
>
> What do you mean by current version ? tip/sched/core ?
>
> > - patch: the new version, with this patch applied
> > - revert: the initial version, with commit [2] reverted
> >
> > Score (higher is better):
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │      0.57 ┆      0.49 ┆       0.58 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy measured with energy counters:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │   1347.13 ┆   2431.67 ┆     510.88 │
> > └───────────┴───────────┴────────────┘
> >
> > Energy computed from util signals and energy model:
> > ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> > │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> > ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> > │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
> > └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> > ┌───────────┬───────────┬────────────┐
> > │ base std  ┆ patch std ┆ revert std │
> > ╞═══════════╪═══════════╪════════════╡
> > │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> > └───────────┴───────────┴────────────┘
> >
> > OU ratio in % (ratio of time being overutilized over total time).
> > The test lasts ~65s:
> > ┌────────────┬────────────┬─────────────┐
> > │ base mean  ┆ patch mean ┆ revert mean │
> > ╞════════════╪════════════╪═════════════╡
> > │     63.39% ┆     12.48% ┆      12.28% │
> > └────────────┴────────────┴─────────────┘
> > ┌───────────┬───────────┬─────────────┐
> > │ base std  ┆ patch std ┆ revert mean │
> > ╞═══════════╪═══════════╪═════════════╡
> > │      0.97 ┆      0.28 ┆        0.88 │
> > └───────────┴───────────┴─────────────┘
> >
> > The energy gain can be explained by the fact that the system is
> > overutilized during most of the test with the base version.
> >
> > During the test, the base condition is evaluated to true ~40%
> > of the time. The new condition is evaluated to true ~2% of
> > the time. Preventing util_est signals to decay with the base
> > condition has a significant impact on the overutilized state
> > due to an overestimation of the resulting utilization of tasks.
> >
> > The score is impacted by the patch, but:
> > - it is expected to have slightly lower scores with EAS running more
> >   often
> > - the base version making the system run at higher frequencies by
> >   overestimating task utilization, it is expected to have higher scores
>
> I'm not sure to get what you are trying to solve here ?
>
> >
> > ------------
> >
> > Decrease util_est when the rq has some idle time instead of a delta
> > between util and runnable signals.
> >
> > [1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
> > [2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
> > https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
> > [3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
> >
> > Signed-off-by: Pierre Gondois <pierre.gondois@....com>
> > ---
> >  kernel/sched/fair.c |  2 +-
> >  kernel/sched/pelt.h | 23 ++++++++++++++++-------
> >  2 files changed, 17 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e9ca38512de..d058ab29e52e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >          * To avoid underestimate of task utilization, skip updates of EWMA if
> >          * we cannot grant that thread got all CPU time it wanted.
> >          */
> > -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>
> You can't use here the test that is done in
> update_idle_rq_clock_pelt() to detect if we lost some idle time
> because this test is only relevant when the rq becomes idle which is
> not the case here
>
> With this test you skip completely the cases where the task has to
> share the CPU with others. As an example on the pixel 6, the little
> cpus must run more than 1.2 seconds at its max freq before detecting
> that there is no idle time
>
> >                 goto done;
> >
> >
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index f4f6a0875c66..eaf34f95fd93 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> >  }
> >
> >  /*
> > - * When rq becomes idle, we have to check if it has lost idle time
> > - * because it was fully busy. A rq is fully used when the /Sum util_sum
> > - * is greater or equal to:
> > + * A rq is fully used (or is considered to not have enough idle time)
> > + * when the /Sum util_sum is greater or equal to:
> >   * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> >   * For optimization and computing rounding purpose, we don't take into account
> >   * the position in the current window (period_contrib) and we use the higher
> >   * bound of util_sum to decide.
> >   */
> > -static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +static inline bool rq_no_idle_pelt(struct rq *rq)
> >  {
> > -       u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > -       u32 util_sum = rq->cfs.avg.util_sum;
> > +       u32 util_sum, divider;
> > +
> > +       divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> > +       util_sum = rq->cfs.avg.util_sum;
> >         util_sum += rq->avg_rt.util_sum;
> >         util_sum += rq->avg_dl.util_sum;
> > +       return util_sum >= divider;
> > +}
> >
> > +/*
> > + * When rq becomes idle, we have to check if it has lost idle time
> > + * because it was fully busy.
> > + */
> > +static inline void update_idle_rq_clock_pelt(struct rq *rq)
> > +{
> >         /*
> >          * Reflecting stolen time makes sense only if the idle
> >          * phase would be present at max capacity. As soon as the
> > @@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> >          * this case. We keep track of this lost idle time compare to
> >          * rq's clock_task.
> >          */
> > -       if (util_sum >= divider)
> > +       if (rq_no_idle_pelt(rq))
> >                 rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> >
> >         _update_idle_rq_clock_pelt(rq);
> > --
> > 2.25.1
> >