linux-kernel - Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCypUSMu8JG_tdqwK4EnAjCbunDXzJvsPwBjqm+D5iG9g@mail.gmail.com>
Date:   Thu, 25 Oct 2018 12:43:23 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thara Gopinath <thara.gopinath@...aro.org>
Subject: Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

On Thu, 25 Oct 2018 at 12:36, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> Hi Vincent,
>
> On 10/19/18 6:17 PM, Vincent Guittot wrote:
> > The current implementation of load tracking invariance scales the
> > contribution with current frequency and uarch performance (only for
> > utilization) of the CPU. One main result of this formula is that the
> > figures are capped by current capacity of CPU. Another one is that the
> > load_avg is not invariant because not scaled with uarch.
> >
> > The util_avg of a periodic task that runs r time slots every p time slots
> > varies in the range :
> >
> >      U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
> >
> > with U is the max util_avg value = SCHED_CAPACITY_SCALE
> >
> > At a lower capacity, the range becomes:
> >
> >      U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
> >
> > with C reflecting the compute capacity ratio between current capacity and
> > max capacity.
> >
> > so C tries to compensate changes in (1-y^r') but it can't be accurate.
> >
> > Instead of scaling the contribution value of PELT algo, we should scale the
> > running time. The PELT signal aims to track the amount of computation of
> > tasks and/or rq so it seems more correct to scale the running time to
> > reflect the effective amount of computation done since the last update.
> >
> > In order to be fully invariant, we need to apply the same amount of
> > running time and idle time whatever the current capacity. Because running
> > at lower capacity implies that the task will run longer, we have to ensure
> > that the same amount of idle time will be apply when system becomes idle
> > and no idle time has been "stolen". But reaching the maximum utilization
> > value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> > always-running task whatever the capacity of the CPU (even at max compute
> > capacity). In this case, we can discard this "stolen" idle times which
> > becomes meaningless.
> >
> > In order to achieve this time scaling, a new clock_pelt is created per rq.
> > The increase of this clock scales with current capacity when something
> > is running on rq and synchronizes with clock_task when rq is idle. With
> > this mecanism, we ensure the same running and idle time whatever the
> > current capacity. This also enables to simplify the pelt algorithm by
> > removing all references of uarch and frequency and applying the same
> > contribution to utilization and loads. Furthermore, the scaling is done
> > only once per update of clock (update_rq_clock_task()) instead of during
> > each update of sched_entities and cfs/rt/dl_rq of the rq like the current
> > implementation. This is interesting when cgroup are involved as shown in
> > the results below:
>
> I have a couple of questions related to the tests you ran.
>
> > On a hikey (octo ARM platform).
> > Performance cpufreq governor and only shallowest c-state to remove variance
> > generated by those power features so we only track the impact of pelt algo.
>
> So you disabled c-state 'cpu-sleep' and 'cluster-sleep'?

yes

>
> I get 'hisi_thermal f7030700.tsensor: THERMAL ALARM: 66385 > 65000' on
> my hikey620. Did you change the thermal configuration? Not sure if there
> are any actions attached to this warning though.

I have a fan to ensure that no thermal mitigation will bias the measurement.

>
> > each test runs 16 times
> >
> > ./perf bench sched pipe
> > (higher is better)
> > kernel        tip/sched/core     + patch
> >          ops/seconds        ops/seconds         diff
> > cgroup
> > root    59648(+/- 0.13%)   59785(+/- 0.24%)    +0.23%
> > level1  55570(+/- 0.21%)   56003(+/- 0.24%)    +0.78%
> > level2  52100(+/- 0.20%)   52788(+/- 0.22%)    +1.32%
> >
> > hackbench -l 1000
>
> Shouldn't this be '-l 100'?

I have re checked and it's -l 1000

>
> > (lower is better)
> > kernel        tip/sched/core     + patch
> >          duration(sec)      duration(sec)        diff
> > cgroup
> > root    4.472(+/- 1.86%)   4.346(+/- 2.74%)     -2.80%
> > level1  5.039(+/- 11.05%)  4.662(+/- 7.57%)     -7.47%
> > level2  5.195(+/- 10.66%)  4.877(+/- 8.90%)     -6.12%
> >
> > The responsivness of PELT is improved when CPU is not running at max
> > capacity with this new algorithm. I have put below some examples of
> > duration to reach some typical load values according to the capacity of the
> > CPU with current implementation and with this patch.
> >
> > Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
> > 972 (95%)    138ms         not reachable            276ms
> > 486 (47.5%)  30ms          138ms                     60ms
> > 256 (25%)    13ms           32ms                     26ms
>
> Could you describe these testcases in more detail?

You don't need to run test case. These numbers are computed based on
geometric series and half period value

>
> So I assume you run one 100% task (possibly pinned to one CPU) on your
> hikey620 with userspace governor and for:
>
>   (1) max capacity:
>
>   echo 1200000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_setspeed
>
>   (2) half capacity:
>
>   echo 729000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_setspeed
>
> and then you measure the time till t1 reaches 25%, 47.5% and 95%
> utilization?
> What's the initial utilization value of t1? I assume t1 starts with
> utilization=512 (post_init_entity_util_avg()).
>
> > On my hikey (octo ARM platform) with schedutil governor, the time to reach
> > max OPP when starting from a null utilization, decreases from 223ms with
> > current scale invariance down to 121ms with the new algorithm. For this
> > test, I have enable arch_scale_freq for arm64.
>
> Isn't the arch-specific arch_scale_freq_capacity() enabled by default on
> arm64 with cpufreq support?

Yes. that's a remain of previous version when arch_scale_freq was not yet merged

>
> I would like to run the same tests so we can discuss results more easily.

Let me know if you need more details