linux-kernel - Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1211876448.5760.24.camel@marge.simson.net>
Date:	Tue, 27 May 2008 10:20:48 +0200
From:	Mike Galbraith <efault@....de>
To:	Greg Smith <gsmith@...gsmith.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>,
	Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
	lkml <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in
	2.6.23+


On Tue, 2008-05-27 at 07:59 +0200, Mike Galbraith wrote:
> On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
> 
> > 2) Mike suggested a patch to 2.6.25 in this thread that backports
> the 
> > feature for disabling SCHED_FEAT_SYNC_WAKEUPS.  Would it be reasonable to 
> > push that into 2.6.25.5?  It's clearly quite useful for this load and 
> > therefore possibly others.
> 
> If the patch above flies, imho, that should be folded into the backport
> to stable.  The heart of the patch is a bugfix, so definitely stable
> material.  Whether to enable the debugging/tweaking knobs as well is a
> different question.  (I would do it, but I ain't the maintainer;)

Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;) 
 
2.6.25.4-feat=45  2.6.25.4-feat=111    2.6.25.4-feat=47
1  11385.471887        10292.721924         9551.157672
2  16709.515434        15540.399522        16283.968970
3  25456.658841        20187.320016        24562.735943
4  24453.435157        24975.037450        23391.583053
5  25504.302958        23102.131056        23671.860667
6  27076.359200        24688.791507        25947.592071
8  31758.200682        29462.639752        29700.144372
10 32190.081142        30428.413809        27439.441838
15 31175.074906        11097.668025        20344.284129
20 30513.974332        10742.166624        19256.695409
30 28307.399275        10233.708047        17535.423344
40 26720.463867        10037.856773        16104.895695
50 24899.945793         9907.624283        15768.746911

Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.

Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.

Signed-off-by: Mike Galbraith <efault@....de>

 kernel/sched.c      |    9 ++++++++-
 kernel/sched_fair.c |   25 ++++++++++++++-----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
 	SCHED_FEAT_START_DEBIT		= 4,
 	SCHED_FEAT_HRTICK		= 8,
 	SCHED_FEAT_DOUBLE_TICK		= 16,
+	SCHED_FEAT_AFFINE_WAKEUPS	= 32,
+	SCHED_FEAT_SYNC_WAKEUPS		= 64,
 };
 
 const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
 		SCHED_FEAT_WAKEUP_PREEMPT	* 1 |
 		SCHED_FEAT_START_DEBIT		* 1 |
 		SCHED_FEAT_HRTICK		* 1 |
-		SCHED_FEAT_DOUBLE_TICK		* 0;
+		SCHED_FEAT_DOUBLE_TICK		* 0 |
+		SCHED_FEAT_AFFINE_WAKEUPS	* 1 |
+		SCHED_FEAT_SYNC_WAKEUPS		* 1;
 
 #define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
 
@@ -1902,6 +1906,9 @@ static int try_to_wake_up(struct task_struct *p,
unsigned int state, int sync)
 	long old_state;
 	struct rq *rq;
 
+	if (!sched_feat(SYNC_WAKEUPS))
+		sync = 0;
+
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0080968..bf2defe 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -992,16 +992,27 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
 	struct task_struct *curr = this_rq->curr;
 	unsigned long tl = this_load;
 	unsigned long tl_per_task;
+	int bad_imbalance;
 
-	if (!(this_sd->flags & SD_WAKE_AFFINE))
+	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
 		return 0;
 
 	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync && tl)
+		tl -= curr->se.load.weight;
+
+	bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && curr->sched_class == &fair_sched_class) {
+	if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
 		if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
 				p->se.avg_overlap < sysctl_sched_migration_cost)
 			return 1;
@@ -1010,16 +1021,8 @@ wake_affine(struct rq *rq, struct sched_domain
*this_sd, struct rq *this_rq,
 	schedstat_inc(p, se.nr_wakeups_affine_attempts);
 	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current CPU:
-	 */
-	if (sync)
-		tl -= current->se.load.weight;
-
 	if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
-			100*(tl + p->se.load.weight) <= imbalance*load) {
+			!bad_imbalance) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/