linux-kernel - Re: [PATCH v2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170503171115.GA19201@linaro.org>
Date:   Wed, 3 May 2017 19:11:15 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Morten Rasmussen <morten.rasmussen@....com>
Cc:     peterz@...radead.org, mingo@...nel.org,
        linux-kernel@...r.kernel.org, dietmar.eggemann@....com,
        yuyang.du@...el.com, pjt@...gle.com, bsegall@...gle.com
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT

Le Friday 28 Apr 2017 à 16:52:59 (+0100), Morten Rasmussen a écrit :
> Hi Vincent,
> 
> Sorry for crashing the party this late. As you know, it takes a long
> period of uninterrupted review time to properly review PELT stuff.
> 
> Disclaimer: I haven't read the rest of the thread yet.
> 
> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> > The current implementation of load tracking invariance scales the
> > contribution with current frequency and uarch performance (only for
> > utilization) of the CPU. One main result of this formula is that the
> > figures are capped by current capacity of CPU. Another one is that the
> > load_avg is not invariant because not scaled with uarch.
> > 
> > The util_avg of a periodic task that runs r time slots every p time slots
> > varies in the range :
> > 
> >     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
> > 
> > with U is the max util_avg value = SCHED_CAPACITY_SCALE
> > 
> > At a lower capacity, the range becomes:
> > 
> >     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
> > 
> > with C reflecting the compute capacity ratio between current capacity and
> > max capacity.
> > 
> > so C tries to compensate changes in (1-y^r') but it can't be accurate.
> 
> I would say that C is trying to compensate for difference in running
> time, which is difference between r and r'. But I think that is what you
> are saying as well.
> 
> > Instead of scaling the contribution value of PELT algo, we should scale the
> > running time. The PELT signal aims to track the amount of computation of
> > tasks and/or rq so it seems more correct to scale the running time to
> > reflect the effective amount of computation done since the last update.
> > 
> > In order to be fully invariant, we need to apply the same amount of
> > running time and idle time whatever the current capacity. Because running
> > at lower capacity implies that the task will run longer, we have to track
> > the amount of "stolen" idle time and to apply it when task becomes idle.
> 
> Overall, I agree that scaling time according to compute capacity is the
> right thing to do, if we want accurate scale-invariance. The current
> implementation is an approximation which isn't perfect, but it is easy
> to implement.
> 
> One thing that became clear to me after reading the patch and tried to
> recreate what it does, is that you don't seem to consider waiting time
> and the patch doesn't scale it. That is fine for utilization, but it
> breaks load. For load, you have to scale waiting time as well, which
> makes things rather complicated as you would need two scaled time
> deltas, one for utilization and one for load.
>

In fact I have always considered that we don't need to scale !running time
(runnable and sleep time) because waiting is the same whatever the frequency or
the micro arch.
Nevertheless, i agree that running longer at lower capacity increases the
number of time a task can be preempted by another another thus increasing the
runnable time. 

The solution would be to scale both running and runnable and to add stolen time
once idle. This seems to be also more accurate in the case of utilization
when the running task is preempted by another task. In fact, it's something
that Peter raised in former comments but i didn't saw the benefit of scaling
runnable at that time... I was probably too much focused on utilization

This behavior is not really noticeable with the short running time use case of
dietmar's example but it becomes relevant when tasks need more time and are
preempted more often by others.

With the addons below of top of the patch, the load_avg issue raised by dietmar's
example is fixed and the utilization remains ok

---
 kernel/sched/fair.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b036f1..3e647b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2854,12 +2854,15 @@ static __always_inline u64
 scale_time(u64 delta, int cpu, struct sched_avg *sa,
 		unsigned long weight, int running)
 {
-	if (running) {
+	if (weight) {
 		/*
-		 * When an entity runs at a lower compute capacity, it will
-		 * need more time to do the same amount of work than at max
-		 * capacity. In order to be invariant, we scale the delta to
-		 * reflect how much work has been really done.
+		 * When an entity is runnable at a lower compute capacity,
+		 * it will need more time to do the same amount of work than
+		 * at max capacity: either because it takes more time to
+		 * compute the same amount of work or because taking more time
+		 * means sharing more often the CPU with others.
+		 * In order to be invariant, we scale the delta to reflect how
+		 * much work has been really done.
 		 * Running at lower capacity also means running longer to do
 		 * the same amount of work and this results in stealing some
 		 * idle time that will disturbed the load signal compared to
@@ -2883,7 +2886,7 @@ scale_time(u64 delta, int cpu, struct sched_avg *sa,
 		 * lower capacity
 		 */
 		sa->stolen_idle_time -= delta;
-	} else if (!weight) {
+	} else {
 		/*
 		 * Entity is sleeping so both utilization and load will decay
 		 * and we can safely add the stolen time. Reflecting some
-- 

>
> What currently happens is that waiting tasks as low OPPs are
> accumulating load at the rate of the highest OPP without accumulating
> "stolen" time. So waiting tasks accumulate too much load at lower OPPs.
> We have confirmed it by scheduling two periodic tasks on the same cpu
> and verified that the load does ramp up faster than it should.
> 
> For the aggregate cfs_rq util/load sums, I can't convince myself whether
> they break when "stolen" time isn't inserted at the same time in the
> task signal as it is for the cfs_rq signal. If you have two periodic
> tasks scheduled on the same cpu with aligned periods, then the first
> task will inject its stolen time as soon as it finishes, while thei

In fact, I think that the 1st task will wait for its next update before
adding lost idle time but that doesn't change the point

> aggregate sum won't inject the stolen time until it has completed the
> second task and inject the stolen for both tasks at the same time. As
> long as no tasks are migrated to/from the cpu it should work, but I'm not
> convinced that it works if tasks are migrated or new tasks show up
> before the stolen time has been committed to the aggregate sum. Have you
> looked more into this?
>
>
task and cfs_rq are synced before any change of the state of task
attach/detach or enqueue/dequeue so i don't expect any out of sync

> 
> Another question is what happens to blocked utilization? IIUC, the
> stolen time is injected when the task wakes up again so the utilization
> drops to where it should be. Does that mean that the blocked
> contribution of the task is inflated the entire time while it is
> blocked?
>

only at the task level but its contribution to the cfs_rq will be updated.
This is exactly what already happen, we update task metrics only when it wakes up

> 
> > But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
> > it means that the task is seen as an always-running task whatever the
> > capacity of the cpu (even at max compute capacity). In this case, we can
> > discard the "stolen" idle times which becomes meaningless. In order to
> > cope with rounding effect of PELT algo we take a margin and consider task
> > with utilization greater than 1000 (vs 1024 max) as an always-running task.
> 
> I understand why the threshold is necessary, but I'm slightly concerned
> that it might introduce corner cases where things could go very wrong. I
> haven't proven that it is indeed a problem, but is it possible to have
> aggregate sum of utilization >1000 without the cpu being fully utilized?
> So the tasks will have their stolen time injected, but the cfs_rq
> utilization won't?
> 
> > 
> > Then, we can use the same algorithm for both utilization and load and
> > simplify __update_load_avg now that the load of a task doesn't have to be
> > capped by CPU uarch.
> 
> As said above, waiting time has to be handled differently for
> utilization and load.
> 
> > The responsivness of PELT is improved when CPU is not running at max
> > capacity with this new algorithm. I have put below some examples of
> > duration to reach some typical load values according to the capacity of the
> > CPU with current implementation and with this patch.
> > 
> > Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
> > 972 (95%)    138ms         not reachable            276ms
> > 486 (47.5%)  30ms          138ms                     60ms
> > 256 (25%)    13ms           32ms                     26ms
> > 
> > On my hikey (octo ARM platform) with schedutil governor, the time to reach
> > max OPP when starting from a null utilization, decreases from 223ms with
> > current scale invariance down to 121ms with the new algorithm. For this
> > test, i have enable arch_scale_freq for arm64.
> 
> Note that the time to reach a specific utilization with your
> implementation depends on the OPP. The lower the OPP, the slower
> utilization accumulates.

yes like the current implementation. Everything is based on : lower OPP means
proportionally longer running time.

Thanks

> 
> [...]
> 
> > @@ -2852,19 +2850,54 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> >  	}
> >  	sa->period_contrib = delta;
> >  
> > -	contrib = cap_scale(contrib, scale_freq);
> >  	if (weight) {
> >  		sa->load_sum += weight * contrib;
> >  		if (cfs_rq)
> >  			cfs_rq->runnable_load_sum += weight * contrib;
> >  	}
> >  	if (running)
> > -		sa->util_sum += contrib * scale_cpu;
> > +		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> >  
> >  	return periods;
> >  }
> >  
> >  /*
> > + * Scale the time to reflect the effective amount of computation done during
> > + * this delta time.
> > + */
> > +static __always_inline u64
> > +scale_time(u64 delta, int cpu, struct sched_avg *sa,
> > +		unsigned long weight, int running)
> > +{
> > +	if (running) {
> > +		sa->stolen_idle_time += delta;
> > +		/*
> > +		 * scale the elapsed time to reflect the real amount of
> > +		 * computation
> > +		 */
> > +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> > +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> > +
> > +		/*
> > +		 * Track the amount of stolen idle time due to running at
> > +		 * lower capacity
> > +		 */
> > +		sa->stolen_idle_time -= delta;
> > +	} else if (!weight) {
> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> > +			/*
> > +			 * Add the idle time stolen by running at lower compute
> > +			 * capacity
> > +			 */
> > +			delta += sa->stolen_idle_time;
> > +		}
> > +		sa->stolen_idle_time = 0;
> > +	}
> > +
> > +	return delta;
> 
> As mentioned above, waiting time, i.e. !running && weight, is not
> scaled, which causes trouble for load.
> 
> Morten