linux-kernel - Re: [RFC patch] CFS fix place entity spread issue (v2)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1271688226.1488.238.camel@laptop>
Date:	Mon, 19 Apr 2010 16:43:46 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Mike Galbraith <efault@....de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Greg Kroah-Hartman <greg@...ah.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Jarkko Nikula <jhnikula@...il.com>,
	Tony Lindgren <tony@...mide.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC patch] CFS fix place entity spread issue (v2)

On Mon, 2010-04-19 at 10:06 -0400, Mathieu Desnoyers wrote:

> place_entity(), when placing a sleeper on the runqueue, decrements its vruntime
> (derived from min_vruntime) from -= thresh (derived from load calculation). So
> if we end up with a task woken up that does not consume enough vruntime to pass
> the previous min_vruntime value, we effectively decrement min_vruntime.

No we don't, the next time it comes along it will at worst use the exact
same min_vruntime, but not a lower one.

> The other thing I am currently looking into is that, without even considering
> the question of going lower than min_vruntime, given that a woken up sleeper
> gets an extra share of vruntime (still because of place_entity), it might always
> run between -thresh to the previous min_runtime, therefore "stalling" the
> min_vruntime values at low values while other threads push the maximum vruntime
> higher, therefore increasing the spread. I'm currently doing some tests trying
> to cope with this by consuming woken up sleepers vruntime more quickly for a
> vruntime duration of "thresh".

But these things will break the wakeup-preemption, you cannot change
place_entity() without also considering
check_preempt_wakeup()->wakeup_preempt_entity().

> I am also testing alternatives ways to deal with the min_vruntime going backward

It doesn't go backwards!! Show me where min_vruntime actually decreases?
Sure can have entities that are left of min_vruntime, but that doesn't
mean min_vruntime decreases!

> problem, by keeping a "floor_vruntime", which ensures that place_entity() can
> never decrement min_vruntime below the floor value of its previous decrement.
> 
> Sorry for the missing changelog info, I was between planes this weekend.
> Hopefully my sleep-deprived explanation makes sense. ;)

None-what-so-ever ;-)


Now, theory says we should use the 0-lag point for task insertion, and I
have a patch that tracks that point, but it adds extra multiplications
to all the fast-paths, and I never really managed to make 0-lag
insertion work well with wakeup-preemption.


I've never really liked sleeper-fairness stuff from a theoretical pov.
but it does provide the 'interactive' stuff lots of things like (otoh it
also does cause some over-scheduling for some loads, and as you've seen
it rips the spread apart).


I also tried an avg_runtime based wakeup-preemption, where a task that
on average runs shorter than the currently running task will preempt
(within the wakeup_gran limits), but that also didn't work out --
although it did work for that one latency-tester app from Jens (IIRC)
that started that).

---
Subject: sched: avg_vruntime
From: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Date: Fri, 24 Oct 2008 11:06:17 +0200

Renicing requires scaling the lag. Therefore we need a way to compute the it.
Lag is defined as the difference between the service time received from the
ideal model and the actual scheduler.

The defining property of a fair scheduler is that the sum of all lags is zero;
which can be seen is trivially true for the ideal case, as all lags are zero.

Therefore, the average of all virtual runtimes will be the point of zero lag.

We cannot prove fairness for CFS due to sleeper fairness (without it we can).
However since we can observe it does converge to fairness in stable operation,
we can say the zero lag point converges to the average.

We can't just take the average of vruntime - as it will use the full range
of its u64 and will wrap around. Instead we'll use the average of
(vruntime - min_vruntime)

\Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn

By factoring out the 1/n (never storing that) we avoid rounding, which
would bring an accumulating error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
LKML-Reference: <20081024091109.832347739@...llo.nl>
---
 kernel/sched.c       |    4 +++
 kernel/sched_debug.c |    3 ++
 kernel/sched_fair.c  |   60 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -349,6 +349,10 @@ struct cfs_rq {
 	struct load_weight load;
 	unsigned long nr_running;
 
+	long nr_queued;
+	long avg_load;
+	s64 avg_vruntime;
+
 	u64 exec_clock;
 	u64 min_vruntime;
 
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -202,6 +202,9 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+
 
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -301,6 +301,60 @@ static inline s64 entity_key(struct cfs_
 	return se->vruntime - cfs_rq->min_vruntime;
 }
 
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load += se->load.weight;
+	cfs_rq->avg_vruntime += key * se->load.weight;
+	cfs_rq->nr_queued++;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_load -= se->load.weight;
+	cfs_rq->avg_vruntime -= key * se->load.weight;
+	cfs_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_vruntime -= cfs_rq->nr_queued * cfs_rq->avg_load * delta;
+}
+
+static u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	s64 avg = cfs_rq->avg_vruntime;
+	long nr_queued = cfs_rq->nr_queued;
+	long load = cfs_rq->avg_load;
+
+	if (cfs_rq->curr) {
+		nr_queued++;
+		avg += entity_key(cfs_rq, cfs_rq->curr);
+		load += cfs_rq->curr->load.weight;
+	}
+
+	if (nr_queued)
+		avg = div_s64(avg, nr_queued);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static void __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		cfs_rq->min_vruntime = vruntime;
+	}
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
@@ -319,7 +373,7 @@ static void update_min_vruntime(struct c
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	__update_min_vruntime(cfs_rq, vruntime);
 }
 
 /*
@@ -333,6 +387,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
+	avg_vruntime_add(cfs_rq, se);
+
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -372,6 +428,8 @@ static void __dequeue_entity(struct cfs_
 	}
 
 	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/