linux-kernel - Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180601174526.GA105687@joelaf.mtv.corp.google.com>
Date:   Fri, 1 Jun 2018 10:45:26 -0700
From:   Joel Fernandes <joelaf@...gle.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Juri Lelli <juri.lelli@...hat.com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        viresh kumar <viresh.kumar@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Quentin Perret <quentin.perret@....com>,
        Luca Abeni <luca.abeni@...tannapisa.it>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        kernel-team@...roid.com,
        Alessio Balsini <alessio.balsini@...tannapisa.it>
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be
> > >> >> run with a DL task and will give similar results.
> > >
> > > In the cover letter you says:
> > >
> > >    A rt-app use case which creates an always running cfs thread and a
> > >    rt threads that wakes up periodically with both threads pinned on
> > >    same CPU, show lot of frequency switches of the CPU whereas the CPU
> > >    never goes idles during the test.
> > >
> > > I would say that's a quite specific corner case where your always
> > > running CFS task has never accumulated a util_est sample.
> > >
> > > Do we really have these cases in real systems?
> > 
> > My example is voluntary an extreme one because it's easier to
> > highlight the problem
> > 
> > >
> > > Otherwise, it seems to me that we are trying to solve quite specific
> > > corner cases by adding a not negligible level of "complexity".
> > 
> > By complexity, do you mean :
> > 
> > Taking into account the number cfs running task to choose between
> > rq->dl.running_bw and avg_dl.util_avg
> > 
> > I'm preparing a patchset that will provide the cfs waiting time in
> > addition to dl/rt util_avg for almost no additional cost. I will try
> > to sent the proposal later today
> 
> 
> The code below adds the tracking of the waiting level of cfs tasks because of
> rt/dl preemption. This waiting time can then be used when selecting an OPP
> instead of the dl util_avg which could become higher than dl bandwidth with
> "long" runtime
> 
> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
> the call to arch_scale_cpu_capacity() can be removed once the later will be
> taken into account when computing the load (which scales only with freq
> currently)
> 
> For rt task, we must keep to take into account util_avg to have an idea of the
> rt level on the cpu which is given by the badnwodth for dl
> 
> ---
>  kernel/sched/fair.c  | 27 +++++++++++++++++++++++++++
>  kernel/sched/pelt.c  |  8 ++++++--
>  kernel/sched/sched.h |  4 +++-
>  3 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eac1f9a..1682ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +static inline void update_cfs_wait_util_avg(struct rq *rq)
> +{
> +	/*
> +	 * If cfs is already enqueued, we don't have anything to do because
> +	 * we already updated the non waiting time
> +	 */
> +	if (rq->cfs.h_nr_running)
> +		return;
> +
> +	/*
> +	 * If rt is running, we update the non wait time before increasing
> +	 * cfs.h_nr_running)
> +	 */
> +	if (rq->curr->sched_class == &rt_sched_class)
> +		update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> +	/*
> +	 * If dl is running, we update the non time before increasing
> +	 * cfs.h_nr_running)
> +	 */
> +	if (rq->curr->sched_class == &dl_sched_class)
> +		update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
> +}
> +

Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

 - Joel