linux-kernel - Re: [PATCH] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtCK9T=1qRxq5YweLntbaRuvzMATQ=vzJ5t-Kb53TvUu+Q@mail.gmail.com>
Date:   Wed, 29 Mar 2017 10:05:00 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Cc:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Yuyang Du <yuyang.du@...el.com>, Paul Turner <pjt@...gle.com>,
        Ben Segall <bsegall@...gle.com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched/fair: update scale invariance of PELT

On 28 March 2017 at 17:35, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> The current implementation of load tracking invariance scales the contribution
> with current frequency and uarch performance (only for utilization) of the
> CPU. One main result of this formula is that the figures are capped by current
> capacity of CPU. Another one is that the load_avg is not invariant because not
> scaled with uarch.
>
> The util_avg of a periodic task that runs r time slots every p time slots
> varies in the range :
>
>     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
>
> with U is the max util_avg value = SCHED_CAPACITY_SCALE
>
> At a lower capacity, the range becomes:
>
>     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
>
> with C reflecting the compute capacity ratio between current capacity and
> max capacity.
>
> so C tries to compensate changes in (1-y^r') but it can't be accurate.
>
> Instead of scaling the contribution value of PELT algo, we should scale the
> running time. The PELT signal aims to track the amount of computation of tasks
> and/or rq so it seems more correct to scale the running time to reflect the
> effective amount of computation done since the last update.
>
> In order to be fully invariant, we need to apply the same amount of running
> time and idle time whatever the current capacity. Because running at lower
> capacity implies that the task will run longer, we have to track the amount of
> "stolen" idle time and to apply it when task becomes idle.
>
> But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
> it means that the task is seen as an always-running task whatever the capacity of
> the cpu (even at max compute capacity). In this case, we can discard the "stolen"
> idle times which becomes meaningless. In order to cope with rounding effect of
> PELT algo we take a margin and consider task with utilization greater than 1000
> (vs 1024 max) as an always-running task.
>
> Then, we can use the same algorithm for both utilization and load and
> simplify __update_load_avg now that the load of a task doesn't have to be
> capped by CPU uarch.
>
> The responsivness of PELT is improved when CPU is not running at max
> capacity with this new algorithm. I have put below some examples of
> duration to reach some typical load values according to the capacity of the
> CPU with current implementation and with this patch.
>
> Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
> 972 (95%)    138ms         not reachable            276ms
> 486 (47.5%)  30ms          138ms                     60ms
> 256 (25%)    13ms           32ms                     26ms
>
> On my hikey (octo ARM platform) with schedutil governor, the time to reach
> max OPP when starting from a null utilization, decreases from 223ms with
> current scale invariance down to 121ms with the new algorithm. For this
> test, i have enable arch_scale_freq for arm64.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> ---

I just noticed that i forgot to mentioned that this patch was a v2 and
the discussion
about v1 can be found here:
https://lkml.org/lkml/2015/11/24/432

Change since v1:
* track stolen idle time to make PELT fully invariant

Vincent

>  include/linux/sched.h |  1 +
>  kernel/sched/fair.c   | 49 ++++++++++++++++++++++++++++++++++---------------
>  2 files changed, 35 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..ca9d00f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -313,6 +313,7 @@ struct load_weight {
>   */
>  struct sched_avg {
>         u64                             last_update_time;
> +       u64                             stolen_idle_time;
>         u64                             load_sum;
>         u32                             util_sum;
>         u32                             period_contrib;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 31453d5..d1514cb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -735,6 +735,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>         struct sched_avg *sa = &se->avg;
>
>         sa->last_update_time = 0;
> +       sa->stolen_idle_time = 0;
>         /*
>          * sched_avg's period_contrib should be strictly less then 1024, so
>          * we give it 1023 to make sure it is almost a period (1024us), and
> @@ -2852,10 +2853,9 @@ static __always_inline int
>  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                   unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
> -       u64 delta, scaled_delta, periods;
> +       u64 delta, periods;
>         u32 contrib;
> -       unsigned int delta_w, scaled_delta_w, decayed = 0;
> -       unsigned long scale_freq, scale_cpu;
> +       unsigned int delta_w, decayed = 0;
>
>         delta = now - sa->last_update_time;
>         /*
> @@ -2876,8 +2876,30 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                 return 0;
>         sa->last_update_time = now;
>
> -       scale_freq = arch_scale_freq_capacity(NULL, cpu);
> -       scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +       if (running) {
> +               sa->stolen_idle_time += delta;
> +               /*
> +                * scale the elapsed time to reflect the real amount of
> +                * computation
> +                */
> +               delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> +               delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> +
> +               /*
> +                * Track the amount of stolen idle time due to running at
> +                * lower capacity
> +                */
> +               sa->stolen_idle_time -= delta;
> +       } else if (!weight) {
> +               if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> +                       /*
> +                        * Add the idle time stolen by running at lower compute
> +                        * capacity
> +                        */
> +                       delta += sa->stolen_idle_time;
> +               }
> +               sa->stolen_idle_time = 0;
> +       }
>
>         /* delta_w is the amount already accumulated against our next period */
>         delta_w = sa->period_contrib;
> @@ -2893,16 +2915,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                  * period and accrue it.
>                  */
>                 delta_w = 1024 - delta_w;
> -               scaled_delta_w = cap_scale(delta_w, scale_freq);
>                 if (weight) {
> -                       sa->load_sum += weight * scaled_delta_w;
> +                       sa->load_sum += weight * delta_w;
>                         if (cfs_rq) {
>                                 cfs_rq->runnable_load_sum +=
> -                                               weight * scaled_delta_w;
> +                                               weight * delta_w;
>                         }
>                 }
>                 if (running)
> -                       sa->util_sum += scaled_delta_w * scale_cpu;
> +                       sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
>
>                 delta -= delta_w;
>
> @@ -2919,25 +2940,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
>                 /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>                 contrib = __compute_runnable_contrib(periods);
> -               contrib = cap_scale(contrib, scale_freq);
>                 if (weight) {
>                         sa->load_sum += weight * contrib;
>                         if (cfs_rq)
>                                 cfs_rq->runnable_load_sum += weight * contrib;
>                 }
>                 if (running)
> -                       sa->util_sum += contrib * scale_cpu;
> +                       sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>         }
>
>         /* Remainder of delta accrued against u_0` */
> -       scaled_delta = cap_scale(delta, scale_freq);
>         if (weight) {
> -               sa->load_sum += weight * scaled_delta;
> +               sa->load_sum += weight * delta;
>                 if (cfs_rq)
> -                       cfs_rq->runnable_load_sum += weight * scaled_delta;
> +                       cfs_rq->runnable_load_sum += weight * delta;
>         }
>         if (running)
> -               sa->util_sum += scaled_delta * scale_cpu;
> +               sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
>
>         sa->period_contrib += delta;
>
> --
> 2.7.4
>