[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170703133420.980724143@linuxfoundation.org>
Date: Mon, 3 Jul 2017 15:35:22 +0200
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: linux-kernel@...r.kernel.org
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
stable@...r.kernel.org, Matt Fleming <matt@...eblueprint.co.uk>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Mike Galbraith <efault@....de>,
Mike Galbraith <umgwanakikbuti@...il.com>,
Morten Rasmussen <morten.rasmussen@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Ingo Molnar <mingo@...nel.org>
Subject: [PATCH 4.9 142/172] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
4.9-stable review patch. If anyone has any objections, please let me know.
------------------
From: Matt Fleming <matt@...eblueprint.co.uk>
commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream.
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Cc: Frederic Weisbecker <fweisbec@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Mike Galbraith <efault@....de>
Cc: Mike Galbraith <umgwanakikbuti@...il.com>
Cc: Morten Rasmussen <morten.rasmussen@....com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@...nel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
---
kernel/sched/loadavg.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -201,8 +201,9 @@ void calc_load_exit_idle(void)
struct rq *this_rq = this_rq();
/*
- * If we're still before the sample window, we're done.
+ * If we're still before the pending sample window, we're done.
*/
+ this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update))
return;
@@ -211,7 +212,6 @@ void calc_load_exit_idle(void)
* accounted through the nohz accounting, so skip the entire deal and
* sync up for the next window.
*/
- this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10))
this_rq->calc_load_update += LOAD_FREQ;
}
Powered by blists - more mailing lists