[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121112184833.GA17503@gmail.com>
Date: Mon, 12 Nov 2012 19:48:33 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Paul Turner <pjt@...gle.com>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Christoph Lameter <cl@...ux.com>,
Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Benchmark results: "Enhanced NUMA scheduling with adaptive affinity"
* Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> Hi,
>
> This series implements an improved version of NUMA scheduling,
> based on the review and testing feedback we got.
>
> [...]
>
> This new scheduler code is then able to group tasks that are
> "memory related" via their memory access patterns together: in
> the NUMA context moving them on the same node if possible, and
> spreading them amongst nodes if they use private memory.
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
system (higher numbers are better):
v3.7-vanilla: run #1: 475630
run #2: 538271
run #3: 533888
run #4: 431525
----------------------------------
avg: 494828 transactions/sec
v3.7-NUMA: run #1: 626692
run #2: 622069
run #3: 630335
run #4: 629817
----------------------------------
avg: 627228 transactions/sec [ +26.7% ]
Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.
[ That is probably so because memory and task placement is more
balanced with NUMA scheduling enabled - while with the vanilla
kernel initial placement of the working set determines the
final performance figure. ]
I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
anymore:
#
# NUMA01
#
perf stat --null --repeat 3 ./numa01
v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds
Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.
[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]
#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC
v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds
Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.
#
# NUMA02
#
perf stat --null --repeat 3 ./numa02
v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds
Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.
Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.
#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds
In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.
I have also created a new perf benchmarking and workload
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).
Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.
Here are various comparisons to the vanilla kernel (higher
numbers are better):
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]
2.2 times faster.
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#
v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]
2.1 times faster.
So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.
Thanks,
Ingo
ps. If anyone is curious about further details, let me know.
The base kernel I used for measurement was commit
02743c9c03f1 + the 8 patches Peter sent out.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists