linux-kernel - Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDTT7JASWnwJwwC9WD25Xqvf1yfiBRaQ0SU6xx9owiaMA@mail.gmail.com>
Date:   Wed, 24 Oct 2018 11:07:24 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     pkondeti@...eaurora.org
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thara Gopinath <thara.gopinath@...aro.org>
Subject: Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

Hi  Pavan,

On Wed, 24 Oct 2018 at 06:53, Pavan Kondeti <pkondeti@...eaurora.org> wrote:
>
> Hi Vincent,
>
> Thanks for the detailed explanation.
>
> On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> > Hi Pavan,
> >
> > On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti <pkondeti@...eaurora.org> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > > >
> > > >  /*
> > > > + * The clock_pelt scales the time to reflect the effective amount of
> > > > + * computation done during the running delta time but then sync back to
> > > > + * clock_task when rq is idle.
> > > > + *
> > > > + *
> > > > + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > > + * @ max capacity  ------******---------------******---------------
> > > > + * @ half capacity ------************---------************---------
> > > > + * clock pelt      | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
> > > > + *
> > > > + */
> > > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > > +{
> > > > +
> > > > +     if (is_idle_task(rq->curr)) {
> > > > +             u32 divider = (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > > +             u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > > +             overload += rq->avg_rt.util_sum;
> > > > +             overload += rq->avg_dl.util_sum;
> > > > +
> > > > +             /*
> > > > +              * Reflecting some stolen time makes sense only if the idle
> > > > +              * phase would be present at max capacity. As soon as the
> > > > +              * utilization of a rq has reached the maximum value, it is
> > > > +              * considered as an always runnnig rq without idle time to
> > > > +              * steal. This potential idle time is considered as lost in
> > > > +              * this case. We keep track of this lost idle time compare to
> > > > +              * rq's clock_task.
> > > > +              */
> > > > +             if (overload >= divider)
> > > > +                     rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > > > +
> > >
> > > I am trying to understand this better. I believe we run into this scenario, when
> > > the frequency is limited due to thermal/userspace constraints. Lets say
> >
> > Yes these are the most common UCs but this can also happen after tasks
> > migration or with a cpufreq governor that doesn't increase OPP fast
> > enough for current utilization.
> >
> > > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > > Fmax/2. The utilization is built up to 100% after several periods.
> > > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
> > > all along. What happens when the CPU enters idle for a short duration and comes
> > > back to run this 100% utilization task?
> >
> > If you are at 100%, we only apply the short idle duration
> >
> > >
> > > If the above block is not present i.e lost_idle_time is not tracked, we
> > > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > > utilization is dropped. Right?
> >
> > yes that 's what would happen. I gives more details below
> >
> > >
> > > With the above block, we don't stretch the idle time. In fact we don't
> > > consider the idle time at all. Because,
> > >
> > > idle_time = now - last_time;
> > >
> > > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
> > > idle_time = rq->clock_pelt_old - last_time
> > >
> > > The last time is nothing but the last snapshot of the rq->clock_pelt when the
> > > task entered sleep due to which CPU entered idle.
> >
> > The condition for dropping this idle time is quite important. This
> > only happens when the utilization reaches max compute capacity of the
> > CPU. Otherwise, the idle time will be fully applied
>
> Right.
>
> rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt
>
> This not only tracks the lost idle time due to running slow but also the
> absolute/real sleep time. For example, when the slow running 100% task
> sleeps for 100 msec, are not we ignoring the 100 msec sleep there?
>
> For example a task ran 323 msec at full capacity and sleeps for (1000-323)
> msec. when it wakes up the utilization is dropped. If the same task runs
> for 626 msec at the half capacity and sleeps for (1000-626), should not
> drop the utilization by taking (1000-626) sleep time into account. I
> understand that why we don't strech idle time to (1000-323) but it is not
> clear to me why we completely drop the idle time.

So this should not happen.
I' m going to update the way I track lost idle time  and move it out
of update_rq_clock_pelt() and only do the test when entering idle
This is even better as it simplifies update_rq_clock_pelt() and
reduces the number of tests for lost idle time

Thanks for spotting this

I'm preparing a new version with this, some build fix for !SMP and the
alignement with cache line suggested by Peter

Vincent
>
> >
> > >