linux-kernel - Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAZqVbMKkSaH8TNVinykR4-dhZuOLr9DOOGt_toPqzeuw@mail.gmail.com>
Date:   Mon, 9 Oct 2017 17:29:04 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Josef Bacik <josef@...icpanda.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Ben Segall <bsegall@...gle.com>,
        Yuyang Du <yuyang.du@...el.com>
Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> Hi Peter,
>
> On 1 September 2017 at 15:21, Peter Zijlstra <peterz@...radead.org> wrote:
>> When an entity migrates in (or out) of a runqueue, we need to add (or
>> remove) its contribution from the entire PELT hierarchy, because even
>> non-runnable entities are included in the load average sums.
>>
>> In order to do this we have some propagation logic that updates the
>> PELT tree, however the way it 'propagates' the runnable (or load)
>> change is (more or less):
>>
>>                      tg->weight * grq->avg.load_avg
>>   ge->avg.load_avg = ------------------------------
>>                                tg->load_avg
>>
>> But that is the expression for ge->weight, and per the definition of
>> load_avg:
>>
>>   ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>>
>> That destroys the runnable_avg (by setting it to 1) we wanted to
>> propagate.
>>
>> Instead directly propagate runnable_sum.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>> ---
>>  kernel/sched/debug.c |    2
>>  kernel/sched/fair.c  |  186 ++++++++++++++++++++++++++++-----------------------
>>  kernel/sched/sched.h |    9 +-
>>  3 files changed, 112 insertions(+), 85 deletions(-)
>>
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>>                         cfs_rq->removed.load_avg);
>>         SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
>>                         cfs_rq->removed.util_avg);
>> +       SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
>> +                       cfs_rq->removed.runnable_sum);
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>         SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
>>                         cfs_rq->tg_load_avg_contrib);
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>>         se->avg.last_update_time = n_last_update_time;
>>  }
>>
>> -/* Take into account change of utilization of a child task group */
>> +
>> +/*
>> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
>> + * propagate its contribution. The key to this propagation is the invariant
>> + * that for each group:
>> + *
>> + *   ge->avg == grq->avg                                               (1)
>> + *
>> + * _IFF_ we look at the pure running and runnable sums. Because they
>> + * represent the very same entity, just at different points in the hierarchy.
>
> I agree for the running part because only one entity can be running
> but i'm not sure for the pure runnable sum because we can have several
> runnable task in a cfs_rq but only one runnable group entity to
> reflect them
> or I misunderstand (1)
>
> As an example, we have 2 always running task TA and TB so their
> load_sum is LOAD_AVG_MAX for each task
> The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
> But
> the ge->avg.load_sum will be only LOAD_AVG_MAX
>
> So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> and on ge->avg.load_sum in particular, the latter decreases to 0
> whereas it should decrease only by half
>
> I have been able to see this wrong behavior with a rt-app json file
>
> so I think that we should instead remove only
>
> delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum

delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
ge->avg.load_sum

as the se has already been detached

> We don't have grq->avg.load_sum but we can have a rough estimate with
> grq->avg.load_avg/grq->weight
>
>
>
>> + *
>> + *
>> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
>> + * simply copies the running sum over.
>> + *
>> + * However, update_tg_cfs_runnable() is more complex. So we have:
>> + *
>> + *   ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg         (2)
>> + *
>> + * And since, like util, the runnable part should be directly transferable,
>> + * the following would _appear_ to be the straight forward approach:
>> + *
>> + *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg       (3)
>> + *
>> + * And per (1) we have:
>> + *
>> + *   ge->avg.running_avg == grq->avg.running_avg
>> + *
>> + * Which gives:
>> + *
>> + *                      ge->load.weight * grq->avg.load_avg
>> + *   ge->avg.load_avg = -----------------------------------            (4)
>> + *                               grq->load.weight
>> + *
>> + * Except that is wrong!
>> + *
>> + * Because while for entities historical weight is not important and we
>> + * really only care about our future and therefore can consider a pure
>> + * runnable sum, runqueues can NOT do this.
>> + *
>> + * We specifically want runqueues to have a load_avg that includes
>> + * historical weights. Those represent the blocked load, the load we expect
>> + * to (shortly) return to us. This only works by keeping the weights as
>> + * integral part of the sum. We therefore cannot decompose as per (3).
>> + *
>> + * OK, so what then?
>> + *
>> + *
>> + * Another way to look at things is:
>> + *
>> + *   grq->avg.load_avg = \Sum se->avg.load_avg
>> + *
>> + * Therefore, per (2):
>> + *
>> + *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
>> + *
>> + * And the very thing we're propagating is a change in that sum (someone
>> + * joined/left). So we can easily know the runnable change, which would be, per
>> + * (2) the already tracked se->load_avg divided by the corresponding
>> + * se->weight.
>> + *
>> + * Basically (4) but in differential form:
>> + *
>> + *   d(runnable_avg) += se->avg.load_avg / se->load.weight
>> + *                                                                (5)
>> + *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
>> + */
>> +
>
> [snip]