linux-kernel - Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtD+HV9xDkReXerjrdVpv6O1Cw8JY8Of9WEUe0vh09qdUQ@mail.gmail.com>
Date:   Tue, 31 Oct 2017 12:14:11 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Josef Bacik <josef@...icpanda.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Ben Segall <bsegall@...gle.com>,
        Yuyang Du <yuyang.du@...el.com>
Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

On 30 October 2017 at 18:20, Peter Zijlstra <peterz@...radead.org> wrote:
> So after a bit of poking I ended up with something like the below; I
> think there's still a few open points, see XXX. But its better than we
> have now.
>
> Josef, could you see if this completely wrecks your workloads?
>
> ---
> Subject: sched: Update runnable propagation rule
> From: Vincent Guittot <vincent.guittot@...aro.org>
> Date: Thu, 19 Oct 2017 17:04:42 +0200
>
> Unlike running, the runnable part can't be directly propagated through
> the hierarchy when we migrate a task. The main reason is that runnable
> time can be shared with other sched_entities that stay on the rq and
> this runnable time will also remain on prev cfs_rq and must not be
> removed.
>
> Instead, we can estimate what should be the new runnable of the prev
> cfs_rq and check that this estimation stay in a possible range. The
> prop_runnable_sum is a good estimation when adding runnable_sum but
> fails most often when we remove it. Instead, we could use the formula
> below instead:
>
>   gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
>
> which assumes that tasks are equally runnable which is not true but
> easy to compute.
>
> Beside these estimates, we have several simple rules that help us to filter
> out wrong ones:
>
>  - ge->avg.runnable_sum <= than LOAD_AVG_MAX
>  - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
>  - ge->avg.runnable_sum can't increase when we detach a task
>
> Cc: Yuyang Du <yuyang.du@...el.com>
> Cc: Ingo Molnar <mingo@...nel.org>
> Cc: Mike Galbraith <efault@....de>
> Cc: Chris Mason <clm@...com>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Josef Bacik <josef@...icpanda.com>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Paul Turner <pjt@...gle.com>
> Cc: Tejun Heo <tj@...nel.org>
> Cc: Morten Rasmussen <morten.rasmussen@....com>
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
> ---
>
>  kernel/sched/fair.c |   99 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 70 insertions(+), 29 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3412,9 +3412,9 @@ void set_task_rq_fair(struct sched_entit
>   * _IFF_ we look at the pure running and runnable sums. Because they
>   * represent the very same entity, just at different points in the hierarchy.
>   *
> - *
> - * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> - * simply copies the running sum over.
> + * Per the above update_tg_cfs_util() is trivial and simply copies the running
> + * sum over (but still wrong, because the group entity and group rq do not have
> + * their PELT windows aligned).
>   *
>   * However, update_tg_cfs_runnable() is more complex. So we have:
>   *
> @@ -3423,11 +3423,11 @@ void set_task_rq_fair(struct sched_entit
>   * And since, like util, the runnable part should be directly transferable,
>   * the following would _appear_ to be the straight forward approach:
>   *
> - *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
> + *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg      (3)
>   *
>   * And per (1) we have:
>   *
> - *   ge->avg.running_avg == grq->avg.running_avg
> + *   ge->avg.runnable_avg == grq->avg.runnable_avg
>   *
>   * Which gives:
>   *
> @@ -3446,27 +3446,28 @@ void set_task_rq_fair(struct sched_entit
>   * to (shortly) return to us. This only works by keeping the weights as
>   * integral part of the sum. We therefore cannot decompose as per (3).
>   *
> - * OK, so what then?
> + * Another reason this doesn't work is that runnable isn't a 0-sum entity.
> + * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
> + * rq itself is runnable anywhere between 2/3 and 1 depending on how the
> + * runnable section of these tasks overlap (or not). If they were to perfectly
> + * align the rq as a whole would be runnable 2/3 of the time. If however we
> + * always have at least 1 runnable task, the rq as a whole is always runnable.
>   *
> + * So we'll have to approximate.. :/
>   *
> - * Another way to look at things is:
> + * Given the constraint:
>   *
> - *   grq->avg.load_avg = \Sum se->avg.load_avg
> + *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
>   *
> - * Therefore, per (2):
> + * We can construct a rule that adds runnable to a rq by assuming minimal
> + * overlap.
>   *
> - *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + * On removal, we'll assume each task is equally runnable; which yields:
>   *
> - * And the very thing we're propagating is a change in that sum (someone
> - * joined/left). So we can easily know the runnable change, which would be, per
> - * (2) the already tracked se->load_avg divided by the corresponding
> - * se->weight.
> + *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
>   *
> - * Basically (4) but in differential form:
> + * XXX: only do this for the part of runnable > running ?

That can be a possible improvement in how we are estimating the the runnable.
I'm going to make some trials to see the impact on the estimation

>   *
> - *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> - *                                                                (5)
> - *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
>   */
>
>  static inline void
> @@ -3478,6 +3479,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>         if (!delta)
>                 return;
>
> +       /*
> +        * The relation between sum and avg is:
> +        *
> +        *   LOAD_AVG_MAX - 1024 + sa->period_contrib
> +        *
> +        * however, the PELT windows are not aligned between grq and gse.
> +        */
> +
>         /* Set new sched_entity's utilization */
>         se->avg.util_avg = gcfs_rq->avg.util_avg;
>         se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> @@ -3490,33 +3499,65 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>  static inline void
>  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>  {
> -       long runnable_sum = gcfs_rq->prop_runnable_sum;
> -       long runnable_load_avg, load_avg;
> -       s64 runnable_load_sum, load_sum;
> +       long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
> +       unsigned long runnable_load_avg, load_avg;
> +       u64 runnable_load_sum, load_sum = 0;
> +       s64 delta_sum;
>
>         if (!runnable_sum)
>                 return;
>
>         gcfs_rq->prop_runnable_sum = 0;
>
> +       if (runnable_sum >= 0) {
> +               /*
> +                * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
> +                * the CPU is saturated running == runnable.
> +                */
> +               runnable_sum += se->avg.load_sum;
> +               runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
> +       } else {
> +               /*
> +                * Estimate the departing task's runnable by assuming all tasks
> +                * are equally runnable.
> +                *
> +                * XXX: doesn't deal with multiple departures?

Why this would not deal with multiple departures ?
we are using gcfs_rq->avg.load_sum that reflects the new state of the
gcfs_rq to evaluate the runnable_sum

> +                */
> +               if (scale_load_down(gcfs_rq->load.weight)) {
> +                       load_sum = div_s64(gcfs_rq->avg.load_sum,
> +                               scale_load_down(gcfs_rq->load.weight));
> +               }
> +
> +               /* But make sure to not inflate se's runnable */
> +               runnable_sum = min(se->avg.load_sum, load_sum);
> +       }
> +
> +       /* runnable_sum can't be lower than running_sum */
> +       running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; /* XXX ? */

running_sum is scaled by cpu's capacity but not load_sum

I have made the shortcut of using SCHED_CAPACITY_SHIFT for capacity
but we might better use arch_scale_cpu_capacity(NULL, cpu) instead

> +       runnable_sum = max(runnable_sum, running_sum);
> +
>         load_sum = (s64)se_weight(se) * runnable_sum;
>         load_avg = div_s64(load_sum, LOAD_AVG_MAX);
>
> -       add_positive(&se->avg.load_sum, runnable_sum);
> -       add_positive(&se->avg.load_avg, load_avg);
> +       delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
> +       delta_avg = load_avg - se->avg.load_avg;
>
> -       add_positive(&cfs_rq->avg.load_avg, load_avg);
> -       add_positive(&cfs_rq->avg.load_sum, load_sum);
> +       se->avg.load_sum = runnable_sum;
> +       se->avg.load_avg = load_avg;
> +       add_positive(&cfs_rq->avg.load_avg, delta_avg);
> +       add_positive(&cfs_rq->avg.load_sum, delta_sum);
>
>         runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
>         runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
> +       delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
> +       delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
>
> -       add_positive(&se->avg.runnable_load_sum, runnable_sum);
> -       add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
> +       se->avg.runnable_load_sum = runnable_sum;
> +       se->avg.runnable_load_avg = runnable_load_avg;
>
>         if (se->on_rq) {
> -               add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
> -               add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
> +               add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
> +               add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
>         }
>  }