linux-kernel - Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 24 Oct 2018 10:23:05 +0530
From:   Pavan Kondeti <pkondeti@...eaurora.org>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thara Gopinath <thara.gopinath@...aro.org>
Subject: Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

Hi Vincent,

Thanks for the detailed explanation.

On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> Hi Pavan,
> 
> On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti <pkondeti@...eaurora.org> wrote:
> >
> > Hi Vincent,
> >
> > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > >
> > >  /*
> > > + * The clock_pelt scales the time to reflect the effective amount of
> > > + * computation done during the running delta time but then sync back to
> > > + * clock_task when rq is idle.
> > > + *
> > > + *
> > > + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > + * @ max capacity  ------******---------------******---------------
> > > + * @ half capacity ------************---------************---------
> > > + * clock pelt      | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
> > > + *
> > > + */
> > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > +{
> > > +
> > > +     if (is_idle_task(rq->curr)) {
> > > +             u32 divider = (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > +             u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > +             overload += rq->avg_rt.util_sum;
> > > +             overload += rq->avg_dl.util_sum;
> > > +
> > > +             /*
> > > +              * Reflecting some stolen time makes sense only if the idle
> > > +              * phase would be present at max capacity. As soon as the
> > > +              * utilization of a rq has reached the maximum value, it is
> > > +              * considered as an always runnnig rq without idle time to
> > > +              * steal. This potential idle time is considered as lost in
> > > +              * this case. We keep track of this lost idle time compare to
> > > +              * rq's clock_task.
> > > +              */
> > > +             if (overload >= divider)
> > > +                     rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
> > > +
> >
> > I am trying to understand this better. I believe we run into this scenario, when
> > the frequency is limited due to thermal/userspace constraints. Lets say
> 
> Yes these are the most common UCs but this can also happen after tasks
> migration or with a cpufreq governor that doesn't increase OPP fast
> enough for current utilization.
> 
> > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > Fmax/2. The utilization is built up to 100% after several periods.
> > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
> > all along. What happens when the CPU enters idle for a short duration and comes
> > back to run this 100% utilization task?
> 
> If you are at 100%, we only apply the short idle duration
> 
> >
> > If the above block is not present i.e lost_idle_time is not tracked, we
> > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > utilization is dropped. Right?
> 
> yes that 's what would happen. I gives more details below
> 
> >
> > With the above block, we don't stretch the idle time. In fact we don't
> > consider the idle time at all. Because,
> >
> > idle_time = now - last_time;
> >
> > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
> > idle_time = rq->clock_pelt_old - last_time
> >
> > The last time is nothing but the last snapshot of the rq->clock_pelt when the
> > task entered sleep due to which CPU entered idle.
> 
> The condition for dropping this idle time is quite important. This
> only happens when the utilization reaches max compute capacity of the
> CPU. Otherwise, the idle time will be fully applied

Right.

rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt

This not only tracks the lost idle time due to running slow but also the
absolute/real sleep time. For example, when the slow running 100% task
sleeps for 100 msec, are not we ignoring the 100 msec sleep there?

For example a task ran 323 msec at full capacity and sleeps for (1000-323)
msec. when it wakes up the utilization is dropped. If the same task runs
for 626 msec at the half capacity and sleeps for (1000-626), should not
drop the utilization by taking (1000-626) sleep time into account. I
understand that why we don't strech idle time to (1000-323) but it is not
clear to me why we completely drop the idle time.

> 
> >
> > Can you please explain the significance of the above block with an example?
> 
> The pelt signal reaches its max value after 323ms at full capacity,
> which means that we can't make any difference between tasks running
> 323ms, 500ms or more at max capacity. As a result, we consider that
> the CPU is fully used and there is no idle time when the utilization
> equals max capacity. If CPU runs at half the capacity, it will run
> 626ms before reaching max utilization and at that time we will stop to
> stretch the idle time because we consider that there is no idle time
> to stretch. By default, we never drop the idle time which is a
> necessary for being fully invariant and we always apply it. But we
> have to drop it when we consider that it would not have been present
> at max capacity too. That's all the purpose of the block that you
> mention

This is very much clear.

> 
> Let take a task that runs 120 ms with a period of 330ms.
> At max capacity, task utilization will vary in the range [10-949]
> At half capacity, task will run 240ms and the range will stay the same
> as the idle time and the running time will be the same once stretched
> and scaled
> At one third of the capacity, task should run 360ms in a period of 330
> which means that the task will always run and will probably even lost
> some events as it will have not finished when the new period will
> start. In this case, the task/CPU utilization will reach the max value
> just like an always running task. As we can't make any difference
> anymore, we consider that there is no idle time to recover once the
> cpu will become idle and the block of code that you mention above will
> cancel the stretch of idle time.
> 

Got it.

> >
> > > +
> > > +             /* The rq is idle, we can sync to clock_task */
> > > +             rq->clock_pelt  = rq_clock_task(rq);
> > > +
> > > +
> > > +     } else {
> > > +             /*
> > > +              * When a rq runs at a lower compute capacity, it will need
> > > +              * more time to do the same amount of work than at max
> > > +              * capacity: either because it takes more time to compute the
> > > +              * same amount of work or because taking more time means
> > > +              * sharing more often the CPU between entities.
> > > +              * In order to be invariant, we scale the delta to reflect how
> > > +              * much work has been really done.
> > > +              * Running at lower capacity also means running longer to do
> > > +              * the same amount of work and this results in stealing some
> > > +              * idle time that will disturb the load signal compared to
> > > +              * max capacity; This stolen idle time will be automaticcally
> > > +              * reflected when the rq will be idle and the clock will be
> > > +              * synced with rq_clock_task.
> > > +              */
> > > +
> > > +             /*
> > > +              * scale the elapsed time to reflect the real amount of
> > > +              * computation
> > > +              */
> > > +             delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
> > > +             delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
> > > +
> > > +             rq->clock_pelt += delta;
> >
> > AFAICT, the rq->clock_pelt is used for both utilization and load. So the load
> > also becomes a function of CPU uarch now. Is this intentional?
> 
> yes, it is. Load is not scaled with uarch in current implementation
> because the load would cap by the max capacity of the local CPU and
> this mess up the load balance.
> 
> Let take the example of CPU0 with max capacity of 1024 and CPU1 with
> max capacity of 512.
> We have 6 always running tasks  with same nice priority
> Then, put 3 tasks on each CPU.
> If the load is scaled/capped with uarch, LB will consider the system
> balanced : 3*max_load / 1024 for CPU0 and 3*(max_load / 2) / 512 for
> CPU1. But tasks on CPU0 have twice more compute capacity than tasks on
> CPU1.
> 
> With the new scaling, we don't have this problem anymore so we can
> take into account uarch and have more accurate load.
> 
Got it.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.