linux-kernel - Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCw=Pu_BSHckGhNV9PuwMRkwaiCa8Tvz36_AaPh=LbCyQ@mail.gmail.com>
Date:   Mon, 9 Oct 2017 17:03:14 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Josef Bacik <josef@...icpanda.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Ben Segall <bsegall@...gle.com>,
        Yuyang Du <yuyang.du@...el.com>
Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

Hi Peter,

On 1 September 2017 at 15:21, Peter Zijlstra <peterz@...radead.org> wrote:
> When an entity migrates in (or out) of a runqueue, we need to add (or
> remove) its contribution from the entire PELT hierarchy, because even
> non-runnable entities are included in the load average sums.
>
> In order to do this we have some propagation logic that updates the
> PELT tree, however the way it 'propagates' the runnable (or load)
> change is (more or less):
>
>                      tg->weight * grq->avg.load_avg
>   ge->avg.load_avg = ------------------------------
>                                tg->load_avg
>
> But that is the expression for ge->weight, and per the definition of
> load_avg:
>
>   ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>
> That destroys the runnable_avg (by setting it to 1) we wanted to
> propagate.
>
> Instead directly propagate runnable_sum.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
>  kernel/sched/debug.c |    2
>  kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
>  kernel/sched/sched.h |    9 +-
>  3 files changed, 112 insertions(+), 85 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>                         cfs_rq->removed.load_avg);
>         SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>                         cfs_rq->removed.util_avg);
> +       SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
> +                       cfs_rq->removed.runnable_sum);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>                         cfs_rq->tg_load_avg_contrib);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>         se->avg.last_update_time = n_last_update_time;
>  }
>
> -/* Take into account change of utilization of a child task group */
> +
> +/*
> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> + * propagate its contribution. The key to this propagation is the invariant
> + * that for each group:
> + *
> + *   ge->avg == grq->avg                                               (1)
> + *
> + * _IFF_ we look at the pure running and runnable sums. Because they
> + * represent the very same entity, just at different points in the hierarchy.

I agree for the running part because only one entity can be running
but i'm not sure for the pure runnable sum because we can have several
runnable task in a cfs_rq but only one runnable group entity to
reflect them
or I misunderstand (1)

As an example, we have 2 always running task TA and TB so their
load_sum is LOAD_AVG_MAX for each task
The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
But
the ge->avg.load_sum will be only LOAD_AVG_MAX

So If we apply directly the d(TB->avg.load_sum) on the group hierachy
and on ge->avg.load_sum in particular, the latter decreases to 0
whereas it should decrease only by half

I have been able to see this wrong behavior with a rt-app json file

so I think that we should instead remove only

delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum

We don't have grq->avg.load_sum but we can have a rough estimate with
grq->avg.load_avg/grq->weight



> + *
> + *
> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> + * simply copies the running sum over.
> + *
> + * However, update_tg_cfs_runnable() is more complex. So we have:
> + *
> + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg         (2)
> + *
> + * And since, like util, the runnable part should be directly transferable,
> + * the following would _appear_ to be the straight forward approach:
> + *
> + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
> + *
> + * And per (1) we have:
> + *
> + *   ge->avg.running_avg == grq->avg.running_avg
> + *
> + * Which gives:
> + *
> + *                      ge->load.weight * grq->avg.load_avg
> + *   ge->avg.load_avg = -----------------------------------            (4)
> + *                               grq->load.weight
> + *
> + * Except that is wrong!
> + *
> + * Because while for entities historical weight is not important and we
> + * really only care about our future and therefore can consider a pure
> + * runnable sum, runqueues can NOT do this.
> + *
> + * We specifically want runqueues to have a load_avg that includes
> + * historical weights. Those represent the blocked load, the load we expect
> + * to (shortly) return to us. This only works by keeping the weights as
> + * integral part of the sum. We therefore cannot decompose as per (3).
> + *
> + * OK, so what then?
> + *
> + *
> + * Another way to look at things is:
> + *
> + *   grq->avg.load_avg = \Sum se->avg.load_avg
> + *
> + * Therefore, per (2):
> + *
> + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + *
> + * And the very thing we're propagating is a change in that sum (someone
> + * joined/left). So we can easily know the runnable change, which would be, per
> + * (2) the already tracked se->load_avg divided by the corresponding
> + * se->weight.
> + *
> + * Basically (4) but in differential form:
> + *
> + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
> + *                                                                (5)
> + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> + */
> +

[snip]