linux-kernel - [PATCH] sched/fair: update scale invariance of pelt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1448372970-8764-1-git-send-email-vincent.guittot@linaro.org>
Date:	Tue, 24 Nov 2015 14:49:30 +0100
From:	Vincent Guittot <vincent.guittot@...aro.org>
To:	peterz@...radead.org, mingo@...nel.org,
	linux-kernel@...r.kernel.org, yuyang.du@...el.com,
	Morten.Rasmussen@....com
Cc:	linaro-kernel@...ts.linaro.org, dietmar.eggemann@....com,
	pjt@...gle.com, bsegall@...gle.com,
	Vincent Guittot <vincent.guittot@...aro.org>
Subject: [PATCH] sched/fair: update scale invariance of pelt

The current implementation of load tracking invariance scales the load
tracking value with current frequency and uarch performance (only for
utilization) of the CPU.

One main result of the current formula is that the figures are capped by
the current capacity of the CPU. This limitation is the main reason of not
including the uarch invariance (arch_scale_cpu_capacity) in the calculation
of load_avg because capping the load can generate erroneous system load
statistic as described with this example [1]

Instead of scaling the complete value of PELT algo, we should only scale
the running time by the current capacity of the CPU. It seems more correct
to only scale the running time because the non running time of a task
(sleeping or waiting for a runqueue) is the same whatever the current freq
and the compute capacity of the CPU.

Then, one main advantage of this change is that the load of a task can
reach max value whatever the current freq and the uarch of the CPU on which
it run. It will just take more time at a lower freq than a max freq or on a
"little" CPU compared to a "big" one. The load and the utilization stay
invariant across system so we can still compared them between CPU but with
a wider range of values.

With this change, we don't have to test if a CPU is overloaded or not in
order to use one metric (util) or another (load) as all metrics are always
valid.

I have put below some examples of duration to reach some typical load value
according to the capacity of the CPU with current implementation
and with this patch. 

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms	   not reachable	    276ms
486 (47.5%)  30ms	   138ms		     60ms
256 (25%)    13ms	    32ms		     26ms

We can see that at half capacity, we need twice the duration of max
capacity with this patch whereas we have a non linear increase of the
duration with current implementation.

[1] https://lkml.org/lkml/2014/12/18/128

Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
---
 kernel/sched/fair.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..f2a18e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2560,10 +2560,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	unsigned int delta_w, decayed = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (running) {
+		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
+		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
+	}
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
+			sa->load_sum += weight * delta_w;
 			if (cfs_rq) {
 				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
+						weight * delta_w;
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
 
 		delta -= delta_w;
 
@@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		sa->load_sum += weight * delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += weight * delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
 
 	sa->period_contrib += delta;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/