linux-kernel - [PATCH] sched/fair: Make PELT signal more accurate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20170804154023.26874-1-joelaf@google.com>
Date:   Fri,  4 Aug 2017 08:40:23 -0700
From:   Joel Fernandes <joelaf@...gle.com>
To:     linux-kernel@...r.kernel.org
Cc:     kernel-team@...roid.com, Joel Fernandes <joelaf@...gle.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@....com>,
        Brendan Jackman <brendan.jackman@....com>,
        Dietmar Eggeman <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>
Subject: [PATCH] sched/fair: Make PELT signal more accurate

The PELT signal (sa->load_avg and sa->util_avg) are not updated if the amount
accumulated during a single update doesn't cross a period boundary. This is
fine in cases where the amount accrued is much smaller than the size of a
single PELT window (1ms) however if the amount accrued is high then the
relative error (calculated against what the actual signal would be had we
updated the averages) can be quite high - as much 3-6% in my testing. On
plotting signals, I found that there are errors especially high when we update
just before the period boundary is hit. These errors can be significantly
reduced if we update the averages more often.

Inorder to fix this, this patch does the average update by also checking how
much time has elapsed since the last update and update the averages if it has
been long enough (as a threshold I chose 128us).

In order to compare the signals with/without the patch I created a synthetic
test (20ms runtime, 100ms period) and analyzed the signals and created a report
on the analysis data/plots both with and without the fix:
http://www.linuxinternals.org/misc/pelt-error.pdf

With the patch, the error in the signal is significantly reduced, and is
non-existent beyond a small negligible amount.

Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@....com>
Cc: Brendan Jackman <brendan.jackman@....com>
Cc: Dietmar Eggeman <dietmar.eggemann@....com>
Signed-off-by: Joel Fernandes <joelaf@...gle.com>
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f1825d60937..1347643737f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2882,6 +2882,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta;
+	int periods;

 	delta = now - sa->last_update_time;
 	/*
@@ -2908,9 +2909,12 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	 * accrues by two steps:
 	 *
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
-	 * crossed period boundaries, finish.
+	 * crossed period boundaries and the time since last update is small
+	 * enough, we're done.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, weight, running, cfs_rq))
+	periods = accumulate_sum(delta, cpu, sa, weight, running, cfs_rq);
+
+	if (!periods && delta < 128)
 		return 0;

 	/*
-- 
2.14.0.rc1.383.gd1ce394fe2-goog