linux-kernel - [tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by delayed NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <tip-6e5f32f7a43f45ee55c401c0b9585eb01f9629a8@git.kernel.org>
Date:   Thu, 16 Mar 2017 04:13:20 -0700
From:   tip-bot for Matt Fleming <tipbot@...or.com>
To:     linux-tip-commits@...r.kernel.org
Cc:     umgwanakikbuti@...il.com, morten.rasmussen@....com, hpa@...or.com,
        peterz@...radead.org, vincent.guittot@...aro.org,
        tglx@...utronix.de, matt@...eblueprint.co.uk,
        linux-kernel@...r.kernel.org, fweisbec@...il.com, mingo@...nel.org,
        efault@....de, torvalds@...ux-foundation.org
Subject: [tip:sched/core] sched/loadavg: Avoid loadavg spikes caused by
 delayed NO_HZ accounting

Commit-ID:  6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Gitweb:     http://git.kernel.org/tip/6e5f32f7a43f45ee55c401c0b9585eb01f9629a8
Author:     Matt Fleming <matt@...eblueprint.co.uk>
AuthorDate: Fri, 17 Feb 2017 12:07:30 +0000
Committer:  Ingo Molnar <mingo@...nel.org>
CommitDate: Thu, 16 Mar 2017 09:21:00 +0100

sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

  this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

  this_rq->calc_load_update = calc_load_update		     [ next window ]

But what we actually do is:

  this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

  commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Cc: Frederic Weisbecker <fweisbec@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Mike Galbraith <efault@....de>
Cc: Mike Galbraith <umgwanakikbuti@...il.com>
Cc: Morten Rasmussen <morten.rasmussen@....com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 kernel/sched/loadavg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b73..3a55f3f 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -202,8 +202,9 @@ void calc_load_exit_idle(void)
 	struct rq *this_rq = this_rq();
 
 	/*
-	 * If we're still before the sample window, we're done.
+	 * If we're still before the pending sample window, we're done.
 	 */
+	this_rq->calc_load_update = calc_load_update;
 	if (time_before(jiffies, this_rq->calc_load_update))
 		return;
 
@@ -212,7 +213,6 @@ void calc_load_exit_idle(void)
 	 * accounted through the nohz accounting, so skip the entire deal and
 	 * sync up for the next window.
 	 */
-	this_rq->calc_load_update = calc_load_update;
 	if (time_before(jiffies, this_rq->calc_load_update + 10))
 		this_rq->calc_load_update += LOAD_FREQ;
 }