linux-kernel - [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 27 May 2008 07:59:10 +0200
From:	Mike Galbraith <efault@....de>
To:	Greg Smith <gsmith@...gsmith.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>,
	Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
	lkml <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
Subject: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:

> I did again get useful results here with the stock 2.6.26.git kernel and 
> default parameters using Peter's small patch to adjust se.waker.
> 
> What I found most interesting was how the results changed when I set 
> /proc/sys/kernel/sched_features = 0, without doing anything with batch 
> mode.  The default for that is 1101111111=895. What I then did was run 
> through setting each of those bits off one by one to see which feature(s) 
> were getting in the way here.  The two that mattered a lot were 895-32=863 
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no 
> SCHED_FEAT_WAKEUP_PREEMPT).  Combining those two but keeping the rest of 
> the features on (895-32-2=861) actually gave the best result I've ever 
> seen here, better than with all the features disabled.  Tossing out all 
> the tests I did that didn't show anything useful, here's my table of the 
> interesting results.
> 
> Clients	.22.19	.26.git	waker	f=0	f=893	f=863	f=861
> 1	7660	11043	11041	9214	11204	9232	9433
> 2	17798	11452	16306	16916	11165	16686	16097
> 3	29612	13231	18476	24202	11348	26210	26906
> 4	25584	13053	17942	26639	11331	25094	25679
> 6	25295	12263	18472	28918	11761	30525	33297
> 8	24344	11748	19109	32730	12190	31775	35912
> 10	23963	11612	19537	31688	12331	29644	36215
> 15	23026	11414	19518	33209	13050	28651	36452
> 20	22549	11332	19029	32583	13544	25776	35707
> 30	22074	10743	18884	32447	14191	21772	33501
> 40	21495	10406	18609	31704	11017	20600	32743
> 50	20051	10534	17478	29483	14683	19949	31047
> 60	18690	9816	17467	28614	14817	18681	29576
> 
> Note that compared to earlier test runs, I replaced the 5 client case with 
> a 60 client one to get more data on the top end.  I also wouldn't pay too 
> much attention to the single client case; that one really bounces around a 
> lot on most of the kernel revs, even with me doing 5 runs and using the 
> median.

Care to give the below a whirl?  If fixes the over-enthusiastic affinity
bug in a less restrictive way.  It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled.  Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho) 

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@....de>

 kernel/sched_fair.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
 	struct task_struct *curr = this_rq->curr;
 	unsigned long tl = this_load;
 	unsigned long tl_per_task;
+	int bad_imbalance;
 
-	if (!(this_sd->flags & SD_WAKE_AFFINE))
+	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
 		return 0;
 
 	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync && tl)
+		tl -= curr->se.load.weight;
+
+	bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && curr->sched_class == &fair_sched_class) {
+	if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
 		if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
 				p->se.avg_overlap < sysctl_sched_migration_cost)
 			return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
 	schedstat_inc(p, se.nr_wakeups_affine_attempts);
 	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	/*
-	 * If sync wakeup then subtract the (maximum possible)
-	 * effect of the currently running task from the load
-	 * of the current CPU:
-	 */
-	if (sync)
-		tl -= current->se.load.weight;
-
 	if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
-			100*(tl + p->se.load.weight) <= imbalance*load) {
+			!bad_imbalance) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and


> Some still open questions after this long investigation that I'd like to 
> know the answers to are:
> 
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike 
> posted?  I'm not sure what was different about his test setup here.  The 
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so 
> the big difference on that column bugs me.

Because those were git _with_ Peter's patch.  I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.

> 2) Mike suggested a patch to 2.6.25 in this thread that backports the 
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS.  Would it be reasonable to 
> push that into 2.6.25.5?  It's clearly quite useful for this load and 
> therefore possibly others.

If the patch above flies, imho, that should be folded into the backport
to stable.  The heart of the patch is a bugfix, so definitely stable
material.  Whether to enable the debugging/tweaking knobs as well is a
different question.  (I would do it, but I ain't the maintainer;)

> 3) Peter's se.waker patch is a big step forward on this workload without 
> any tuning, closing a significant amount of the gap between the default 
> setup and what I get with the two troublesome features turned off 
> altogether.  What issues might there be with pushing that into the stock 
> {2.6.25|2.6.26} kernel?

(it doesn't fully address the 1:N needs, that needs much more thought)

> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS 
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled?  I'd think that any attempt to 
> tune those code paths would need my case for "works better when 
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to 
> balance modifications against.

I suspect very many, certainly any load where latency is of major
importance.  Peak performance of the mysql+oltp benchmark for sure is
injured with your settings.  As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.

> 5) Once (4) has identified some tests cases, what else might be done to 
> make the default behavior better without killing the situations it's 
> intended for?

See patch :)

	-Mike

View attachment "starve.c" of type "text/x-csrc" (716 bytes)