linux-kernel - Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtC-M70u+gK2bEd=yQ8pdG1A3Opm-U5f=K-Hxhc6OXgM=w@mail.gmail.com>
Date:   Fri, 11 Jan 2019 15:29:48 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Patrick Bellasi <patrick.bellasi@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thara Gopinath <thara.gopinath@...aro.org>,
        pkondeti@...eaurora.org, Quentin Perret <quentin.perret@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
Subject: Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

On Thu, 10 Jan 2019 at 16:30, Patrick Bellasi <patrick.bellasi@....com> wrote:
>
> On 29-Nov 17:19, Vincent Guittot wrote:
> > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@....com> wrote:
> > > On 29-Nov 11:43, Vincent Guittot wrote:
>
> [...]
>
> > > Seems we agree that, when there is no idle time:
> > > - the two 15% tasks will be overestimated
> > > - their utilization will reach 50% after a while
> > >
> > > If I'm not wrong, we will have:
> > > - 30% CPU util in  ~16ms @1024 capacity
> > >                    ~64ms  @256 capacity
> > >
> > > Thus, the tasks will be certainly over-estimated after ~64ms.
> > > Is that correct ?
> >
> > From a pure util_avg pov it's correct
> > But i'd like to weight that a bit with the example below
> >
> > > Now, we can argue that 64ms is a pretty long time and thus it's quite
> > > unlucky we will have no idle for such a long time.
> > >
> > > Still, I'm wondering if we should keep collecting those samples or
> > > better find a way to detect that and skip the sampling.
> >
> > The problem is that you can have util_avg above capacity even with idle time
> > In the 1st example of this thread, the 39ms/80ms task will reach 709
> > which is the value saved by util_est on a big core
> > But on core with half capacity, there is still idle time so 709 is a
> > correct value although above 512
>
> Right, I see your point and (in principle) I like the idea of
> collecting samples for tasks which happen to run at a lower capacity
> then required and the utilization value makes sense...
>
> > In fact, max will be always above the linear ratio because it's based
> > on geometric series
> >
> > And this is true even with 15.6ms/32ms (same ratio as above) task
> > although the impact is smaller (max value, which should be saved by
> > util est, becomes  587 in this case).
>
> However that's not always the case... as per my example above.
>
> Moreover, we should also consider that util_est is mainly meant to be
> a lower-bound for tasks utilization.
> That's why task_util_est() already returns the actual util_avg when
> it's higher than the estimated utilization.

I can imagine that the fact that we use max(util_avg, util_est) helps
to keep using correct utilization in the scheduler when util_avg goes
above cpu capacity whereas there is still idle time

>
> With your new signal and without any special check on samples
> collection, if a task is limited because of thermal capping for
> example, we could end up overestimating its utilization and thus
> perhaps generating an unwanted frequency spike when the capping is
> relaxed... and (even worst) it will take some more activations for the
> estimated utilization to converge back to the actual utilization.
>
> Since we cannot easily know if there is idle time in a CPU when a task
> completes an activation with a utilization higher then the CPU
> capacity, I would better prefer to just skip the sampling with
> something like:
>
> ---8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9332863d122a..485053026533 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
>  {
>         long last_ewma_diff;
>         struct util_est ue;
> +       int cpu;
>
>         if (!sched_feat(UTIL_EST))
>                 return;
> @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
>         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
>                 return;
>
> +       /*
> +        * To avoid overestimation of actual task utilization, skip updates if
> +        * we cannot grant there is idle time in this CPU.
> +        */
> +       cpu = cpu_of(rq_of(cfs_rq));
> +       if (task_util(p) > cpu_capacity(cpu))
> +               return;
> +
>         /*
>          * Update Task's estimated utilization
>          *
> ---8<---
>
> At least this will ensure that util_est always provides an actual
> measured lower bound for a task utilization.
>
> If you think this makes sense, feel free to add such a patch on
> top of your series.

ok. I'm going to add it when rebasing the series

Thanks
Vincent
>
> Cheers Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi