linux-kernel - Re: [PATCH 2/2] sched/fair: util_est: add running

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 5 Jun 2018 12:33:17 -0700
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Patrick Bellasi <patrick.bellasi@....com>
Cc:     linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Steve Muckle <smuckle@...gle.com>, Todd Kjos <tkjos@...gle.com>
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f74441be3f44..5d54d6a4c31f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >  		sa->runnable_load_sum =
> > >  			decay_load(sa->runnable_load_sum, periods);
> > >  		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > +		if (running)
> > > +			sa->running_sum = decay_load(sa->running_sum, periods);
> > >  
> > >  		/*
> > >  		 * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > >  		sa->load_sum += load * contrib;
> > >  	if (runnable)
> > >  		sa->runnable_load_sum += runnable * contrib;
> > > -	if (running)
> > > +	if (running) {
> > >  		sa->util_sum += contrib * scale_cpu;
> > > +		sa->running_sum += contrib * scale_cpu;
> > > +	}
> > >  
> > >  	return periods;
> > >  }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > >  	WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > >  }
> > 
> > PELT changes look nice and makes sense :)
> 
> That's not strictly speaking a PELT change... it's still more in the
> idea to work "on top of PELT" to make it more effective in measuring
> the tasks expected required CPU bandwidth.

I meant "PELT change" as in change to the code that calculates PELT signals..

> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > +	/* Initilize the (non-preempted) utilization */
> > > +	p->se.avg.running_sum = p->se.avg.util_sum;
> > > +}
> > > +
> > >  /*
> > >   * Check if a (signed) value is within a specified (unsigned) margin,
> > >   * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > >  	 * Skip update of task's estimated utilization when its EWMA is
> > >  	 * already ~1% close to its last activation value.
> > >  	 */
> > > -	ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > +	ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > 
> > I guess we are doing extra division here which adds some cost. Does
> > performance look Ok with the change?
> 
> This extra division is there and done only at dequeue time instead of
> doing it at each update_load_avg.

I know. :)

> To be more precise, at each ___update_load_avg we should really update
> running_avg by:
> 
>    u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
>    sa->running_avg = sa->running_sum / divider;
> 
> but, this would imply tracking an additional signal in sched_avg and
> doing an additional division at ___update_load_avg() time.
> 
> Morten suggested that, if we accept the rounding errors due to
> considering
> 
>       divider ~= LOAD_AVG_MAX
> 
> thus discarding the (sa->period_contrib - 1024) correction, then we
> can completely skip the tracking of running_avg (thus saving space in
> sched_avg) and approximate it at dequeue time as per the code line,
> just to compute the new util_est sample to accumulate.
> 
> Does that make sense now?

The patch always made sense to me.. I was just pointing out the extra
division this patch adds. I agree since its done on dequeue-only, then its
probably Ok to do..

thanks,

 - Joel