linux-kernel - Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBu5D=P3ib6fqfVLz9yQXN19ipRQAz5L6s1+ewnVTrHmg@mail.gmail.com>
Date:   Mon, 4 Jun 2018 08:41:51 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Joel Fernandes <joelaf@...gle.com>
Cc:     Juri Lelli <juri.lelli@...hat.com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        viresh kumar <viresh.kumar@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Quentin Perret <quentin.perret@....com>,
        Luca Abeni <luca.abeni@...tannapisa.it>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        "Cc: Android Kernel" <kernel-team@...roid.com>,
        Alessio Balsini <alessio.balsini@...tannapisa.it>
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 1 June 2018 at 19:45, Joel Fernandes <joelaf@...gle.com> wrote:
> On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> > >> >> The example with a RT task described in the cover letter can be
>> > >> >> run with a DL task and will give similar results.
>> > >
>> > > In the cover letter you says:
>> > >
>> > >    A rt-app use case which creates an always running cfs thread and a
>> > >    rt threads that wakes up periodically with both threads pinned on
>> > >    same CPU, show lot of frequency switches of the CPU whereas the CPU
>> > >    never goes idles during the test.
>> > >
>> > > I would say that's a quite specific corner case where your always
>> > > running CFS task has never accumulated a util_est sample.
>> > >
>> > > Do we really have these cases in real systems?
>> >
>> > My example is voluntary an extreme one because it's easier to
>> > highlight the problem
>> >
>> > >
>> > > Otherwise, it seems to me that we are trying to solve quite specific
>> > > corner cases by adding a not negligible level of "complexity".
>> >
>> > By complexity, do you mean :
>> >
>> > Taking into account the number cfs running task to choose between
>> > rq->dl.running_bw and avg_dl.util_avg
>> >
>> > I'm preparing a patchset that will provide the cfs waiting time in
>> > addition to dl/rt util_avg for almost no additional cost. I will try
>> > to sent the proposal later today
>>
>>
>> The code below adds the tracking of the waiting level of cfs tasks because of
>> rt/dl preemption. This waiting time can then be used when selecting an OPP
>> instead of the dl util_avg which could become higher than dl bandwidth with
>> "long" runtime
>>
>> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
>> the call to arch_scale_cpu_capacity() can be removed once the later will be
>> taken into account when computing the load (which scales only with freq
>> currently)
>>
>> For rt task, we must keep to take into account util_avg to have an idea of the
>> rt level on the cpu which is given by the badnwodth for dl
>>
>> ---
>>  kernel/sched/fair.c  | 27 +++++++++++++++++++++++++++
>>  kernel/sched/pelt.c  |  8 ++++++--
>>  kernel/sched/sched.h |  4 +++-
>>  3 files changed, 36 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eac1f9a..1682ea7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>>  }
>>  #endif
>>
>> +static inline void update_cfs_wait_util_avg(struct rq *rq)
>> +{
>> +     /*
>> +      * If cfs is already enqueued, we don't have anything to do because
>> +      * we already updated the non waiting time
>> +      */
>> +     if (rq->cfs.h_nr_running)
>> +             return;
>> +
>> +     /*
>> +      * If rt is running, we update the non wait time before increasing
>> +      * cfs.h_nr_running)
>> +      */
>> +     if (rq->curr->sched_class == &rt_sched_class)
>> +             update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +
>> +     /*
>> +      * If dl is running, we update the non time before increasing
>> +      * cfs.h_nr_running)
>> +      */
>> +     if (rq->curr->sched_class == &dl_sched_class)
>> +             update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +}
>> +
>
> Please correct me if I'm wrong but the CFS preemption-decay happens in
> set_next_entity -> update_load_avg when the CFS task is scheduled again after
> the preemption. Then can we not fix this issue by doing our UTIL_EST magic
> from set_next_entity? But yeah probably we need to be careful with overhead..

util_est is there to keep track of the last max. I'm not sure that
trying to add some magics to take into account preemption is the right
way to do. Mixing several information in the same metric just add more
fuzzy in the meaning of the metric.

>
> IMO I feel its overkill to account dl_avg when we already have DL's running
> bandwidth we can use. I understand it may be too instanenous, but perhaps we

We keep using dl bandwidth which is quite correct for dl needs but
doesn't reflect how it has disturbed other classes

> can fix CFS's problems within CFS itself and not have to do this kind of
> extra external accounting ?
>
> I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
> from within CFS class as being done above.
>
> thanks,
>
>  - Joel
>