linux-kernel - Re: [PATCH 2/2] sched/fair: util_est: add running

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 5 Jun 2018 16:11:29 +0100
From:   Patrick Bellasi <patrick.bellasi@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        "open list:THERMAL" <linux-pm@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Steve Muckle <smuckle@...gle.com>, Todd Kjos <tkjos@...gle.com>
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Vincent,

On 05-Jun 08:57, Vincent Guittot wrote:
> On 4 June 2018 at 18:06, Patrick Bellasi <patrick.bellasi@....com> wrote:

[...]

> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> >  a) at each enqueue time of a task, its value is the (already decayed)
> >     util_avg of the task being enqueued
> >  b) it's updated at each update_load_avg
> >  c) it can just increase, whenever the task is actually RUNNING on a
> >     CPU, while it's kept stable while the task is RUNNANBLE but not
> >     actively consuming CPU bandwidth
> >
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
> 
> I don't agree with this statement above
> Let say that you have 2 periodic tasks which wants to run 4ms in every
> period of 10ms and wakes up at the same time.
> One task will execute 1st and the other will wait but at the end they
> have the same period and running time and as a result the same
> util_avg which is exactly what they need.

In your example above you say that both tasks will end up with a 40%
util_avg, and that's exactly also the value reported by the new
"running_avg" metric.

Both tasks, if they where running alone, they would have generated a
40% utilization, and above I'm saying:

        it allows to know at dequeue time
   how much would have been the task utilization
      if it it was running alone on that CPU

Don't see where is not correct, maybe I should explain it better?

> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
> 
> In fact, you still take into account this preemption as you remove
> some time whereas it's only a shift of the excution

When the task is enqueued we cache the (already decayed) util_avg, and
from this time on the running_avg can only increase. It increases only
for the portion of CPU time the task is running and it's never decayed
while the task is preempted.

In your example above, if we look at the second task running, the one
"delayed", you will have:

 @t1 wakeup time:    running_avg@t1 = util_avg@t1
                        since we initialize it with the
                        "sleep decayed" util_avg

  NOTE: this initialization is the important part, more on that on your
  next comment below related to the period being changed.

 @t2 switch_in time: running_avg@t2 = running_avg@t1
                        since it's not decayed while RUNNUBLE but !RUNNING

                     while, meanwhile:

                     util_avg@t2 < util_avg@t1
                        since it's decayed while RUNNABLE but !RUNNING

 @t3 dequeue time:   running_avg@t3 > running_avg@t2


When you say "it's only a shift of the execution" I agree, and indeed
the running_avg is not affected at all.

So, we can say that I'm "accounting for preemption time" but really,
the way I do it, is just by _not decaying_ PELT when the task is:

                 RUNNABLE but !RUNNING

and that's why I say that running_avg gives you the same behavior of
util_avg _if_ the task was running alone, because in that case you never
have the condition above, and thus util_avg is never decayed.

[...]

> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >                 sa->runnable_load_sum =
> >                         decay_load(sa->runnable_load_sum, periods);
> >                 sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > +               if (running)
> > +                       sa->running_sum = decay_load(sa->running_sum, periods);
> 
> so you make some time disappearing from the equation as the signal
> will not be decayed and make the period looking shorter than reality.

Since at enqueue time we always initialize running_avg to whatever is
util_avg, I don't think we are messing up with the period at all.

The util_avg signal properly accounts for the period.
Thus, the combined effect of:

  - initializing running_avg at enqueue time with the value of
    util_avg, already decayed to properly account for the task period
  - not decaying running_avg when the task is RUNNABLE but !RUNNING

should just result into "virtually" considering the two tasks of your
example "as if" they was running concurrently on two different CPUs.

Isn't it?

> With the example I mentioned above, the 2nd task will be seen as if
> its period becomes 6ms instead of 10ms which is not true and the
> utilization of the task is overestimated without any good reason

I don't see that overestimation... and I cannot even measure it.

If I run an experiment with your example above, while using the
performance governor to rule out any possible scale invariance
difference, here is what I measure:

   Task1 (40ms delayed by the following Task2):
                               mean          std     max
      running_avg        455.387449    22.940168   492.0
      util_avg           433.233288    17.395477   458.0

   Task2 (waking up at same time of Task1 and running before):
                               mean          std     max
      running_avg        430.281834    22.405175   455.0
      util_avg           421.745331    22.098873   456.0

and if I compare Task1 above with another experiment where Task1 is
running alone:

   Task1 (running alone):
                               mean          std     min
      running_avg        460.257895    22.103704   460.0
      util_avg           435.119737    17.647556   461.0


Thus, there are some variations ok, but I think they are more due to
the rounding errors we have.
Task1 is still seen as a 455 expected utilization, which is not the
4/6 ~= 660 you say above if it should be accounted as a task running
4ms every 6ms.

> Furthermore, this new signal doesn't have clear meaning because it's
> not utilization but it's also not the waiting time of the task.

The meaning is: the utilization of the task _if_ it should be running
alone on that CPU. Tehe numbers above shows that since
    455 (Task1 delayed) ~= 460 (Task1 running alone)

If it's running alone it would be exactly util_avg (minus rounding
errors), but if it's preempted it will be a better representation of
the task's needed CPU bandwidth.

Unless we really need to track the "waiting time of a task" I would
say that the signal above should make util_est much more accurate on
knowing, at wakeup time, how much CPU a task will likely need...
independently from preemption sources.

Here are some other experiments / workloads I'm testing:

   https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d

> >
> >                 /*
> >                  * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >                 sa->load_sum += load * contrib;
> >         if (runnable)
> >                 sa->runnable_load_sum += runnable * contrib;
> > -       if (running)
> > +       if (running) {
> >                 sa->util_sum += contrib * scale_cpu;
> > +               sa->running_sum += contrib * scale_cpu;
> > +       }
> >
> >         return periods;
> >  }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> >         WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> >  }
> >
> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > +       /* Initilize the (non-preempted) utilization */
> > +       p->se.avg.running_sum = p->se.avg.util_sum;

This line above is what should ensure we do not mess up with the task
period.

> > +}
> > +
> >  /*
> >   * Check if a (signed) value is within a specified (unsigned) margin,
> >   * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> >          * Skip update of task's estimated utilization when its EWMA is
> >          * already ~1% close to its last activation value.
> >          */
> > -       ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > +       ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> >         last_ewma_diff = ue.enqueued - ue.ewma;
> >         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> >                 return;
> > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >         if (!se)
> >                 add_nr_running(rq, 1);
> >
> > +       util_est_enqueue_running(p);
> > +
> >         hrtick_update(rq);
> >  }
> >
> > --
> > 2.15.1
> >

-- 
#include <best/regards.h>

Patrick Bellasi