[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1211867950.5505.47.camel@marge.simson.net>
Date: Tue, 27 May 2008 07:59:10 +0200
From: Mike Galbraith <efault@....de>
To: Greg Smith <gsmith@...gsmith.com>
Cc: Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>,
Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
lkml <linux-kernel@...r.kernel.org>,
Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
Subject: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+
On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
> I did again get useful results here with the stock 2.6.26.git kernel and
> default parameters using Peter's small patch to adjust se.waker.
>
> What I found most interesting was how the results changed when I set
> /proc/sys/kernel/sched_features = 0, without doing anything with batch
> mode. The default for that is 1101111111=895. What I then did was run
> through setting each of those bits off one by one to see which feature(s)
> were getting in the way here. The two that mattered a lot were 895-32=863
> (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
> SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
> the features on (895-32-2=861) actually gave the best result I've ever
> seen here, better than with all the features disabled. Tossing out all
> the tests I did that didn't show anything useful, here's my table of the
> interesting results.
>
> Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
> 1 7660 11043 11041 9214 11204 9232 9433
> 2 17798 11452 16306 16916 11165 16686 16097
> 3 29612 13231 18476 24202 11348 26210 26906
> 4 25584 13053 17942 26639 11331 25094 25679
> 6 25295 12263 18472 28918 11761 30525 33297
> 8 24344 11748 19109 32730 12190 31775 35912
> 10 23963 11612 19537 31688 12331 29644 36215
> 15 23026 11414 19518 33209 13050 28651 36452
> 20 22549 11332 19029 32583 13544 25776 35707
> 30 22074 10743 18884 32447 14191 21772 33501
> 40 21495 10406 18609 31704 11017 20600 32743
> 50 20051 10534 17478 29483 14683 19949 31047
> 60 18690 9816 17467 28614 14817 18681 29576
>
> Note that compared to earlier test runs, I replaced the 5 client case with
> a 60 client one to get more data on the top end. I also wouldn't pay too
> much attention to the single client case; that one really bounces around a
> lot on most of the kernel revs, even with me doing 5 runs and using the
> median.
Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).
With default features, I get the below.
2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434
(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@....de>
kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
> Some still open questions after this long investigation that I'd like to
> know the answers to are:
>
> 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
> posted? I'm not sure what was different about his test setup here. The
> 2.6.22 results are pretty similar, and the fully tuned ones as well, so
> the big difference on that column bugs me.
Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.
> 2) Mike suggested a patch to 2.6.25 in this thread that backports the
> feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
> push that into 2.6.25.5? It's clearly quite useful for this load and
> therefore possibly others.
If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)
> 3) Peter's se.waker patch is a big step forward on this workload without
> any tuning, closing a significant amount of the gap between the default
> setup and what I get with the two troublesome features turned off
> altogether. What issues might there be with pushing that into the stock
> {2.6.25|2.6.26} kernel?
(it doesn't fully address the 1:N needs, that needs much more thought)
> 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
> and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
> tune those code paths would need my case for "works better when
> SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
> balance modifications against.
I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.
> 5) Once (4) has identified some tests cases, what else might be done to
> make the default behavior better without killing the situations it's
> intended for?
See patch :)
-Mike
View attachment "starve.c" of type "text/x-csrc" (716 bytes)
Powered by blists - more mailing lists