linux-kernel - Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160704150452.GP8415@codeblueprint.co.uk>
Date:	Mon, 4 Jul 2016 16:04:52 +0100
From:	Matt Fleming <matt@...eblueprint.co.uk>
To:	Dietmar Eggemann <dietmar.eggemann@....com>
Cc:	Mike Galbraith <umgwanakikbuti@...il.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Yuyang Du <yuyang.du@...el.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec
 balancing

On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote:
> On 14/06/16 17:40, Mike Galbraith wrote:
> > On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote:
> > 
> >> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the
> >> fact that a new task gets all it's load decayed (making it a small task)
> >> in the __update_load_avg() call in remove_entity_load_avg() because its
> >> se->avg.last_update_time value is 0 which creates a huge time difference
> >> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f
> >> avoids this and thus the task stays big se->avg.load_avg = 1024.
> > 
> > I don't care much at all about the hackbench "regression" in its own
> > right, and what causes it, for me, bottom line is that there are cases
> > where we need to be able to resolve, and can't, simply because we're
> > looking at a fuzzy (rippling) reflection.
> 
> Understood. I just thought it would be nice to know why 0905f04eb21f
> makes this problem even more visible. But so far I wasn't able to figure
> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on
> cfs_rq->runnable_load_avg making it even less suitable in find idlest*.
> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks
> suspicious though.

In my testing without 0905f04eb21f I saw that se->avg.load_avg
actually managed to skip being decayed at all before the task was
dequeued, which meant that cfs_rq->runnable_load_avg was more likely
to be zero after dequeue, for those workloads like hackbench that
essentially are just a fork bomb.

se->avg.load_avg evaded decay because se->avg.period_contrib was being
zero'd in __update_load_avg().

With 0905f04eb21f applied, it's less likely (though not impossible)
that ->period_contrib will be zero'd and so we usually end up with
some residual load in cfs_rq->runnable_load_avg on dequeue, and hence,

	cfs_rq->runnable_load_avg > se->avg.load_avg

even if 'se' is the only task on the runqueue.

FYI, below is my quick and dirty hack that restored hackbench
performance for the few machines I checked. I didn't try schbench with
it.

---

>From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@...eblueprint.co.uk>
Date: Thu, 9 Jun 2016 19:48:14 +0100
Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last
 entity

The task and runqueue load averages maintained in p->se.avg.load_avg
and cfs_rq->runnable_load_avg respectively, can decay at different
wall clock rates, which means that enqueueing and then dequeueing a
task on an otherwise empty runqueue doesn't always leave
::runnable_load_avg with its initial value.

This can lead to the situation where cfs_rq->runnable_load_avg has a
non-zero value even though there are no runnable entities on the
runqueue. Assuming no entity is enqueued on this runqueue for some
time this residual load average will decay gradually as the load
averages are updated.

But we can optimise the special case of dequeueing the last entity and
reset ::runnable_load_avg early, which gives a performance improvement
to workloads that trigger the load balancer, such as fork-heavy
applications when SD_BALANCE_FORK is set, because it gives a more up
to date view of how busy the cpu is.

Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>
---
 kernel/sched/fair.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dd8bab010c..408ee90c7ea8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	unsigned long load_avg = 0;
+
 	update_load_avg(se, 1);

-	cfs_rq->runnable_load_avg =
-		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
+	/*
+	 * If we're about to dequeue the last runnable entity we can
+	 * reset the runnable load average to zero instead of waiting
+	 * for it to decay naturally. This gives the load balancer a
+	 * more timely and accurate view of how busy this cpu is.
+	 */
+	if (cfs_rq->nr_running > 1)
+		load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
+
+	cfs_rq->runnable_load_avg = load_avg;
 	cfs_rq->runnable_load_sum =
 		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
 }
-- 
2.7.3