[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1354305521-11583-1-git-send-email-mingo@kernel.org>
Date: Fri, 30 Nov 2012 20:58:31 +0100
From: Ingo Molnar <mingo@...nel.org>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Paul Turner <pjt@...gle.com>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Christoph Lameter <cl@...ux.com>,
Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Johannes Weiner <hannes@...xchg.org>,
Hugh Dickins <hughd@...gle.com>
Subject: [PATCH 00/10] Latest numa/core release, v18
I'm pleased to announce the latest, -v18 numa/core release.
This release fixes regressions and improves NUMA performance.
It has the following main changes:
- Introduce directed NUMA convergence, which is based on
the 'task buddy' relation introduced in -v17, and make
use of the new "task flipping" facility.
- Add "related task group" balancing notion to the scheduler, to
be able to 'compress' and 'spread' NUMA workloads
based on which tasks relate to each other via their
working set (i.e. which tasks access the same memory areas).
- Track the quality and strength of NUMA convergence and
create a feedback loop with the scheduler:
- use it to direct migrations
- use it to slow down and speed up the rate of the
NUMA hinting page faults
- Turn 4K pte NUMA faults into effective hugepage ones
- Refine the 'shared tasks' memory interleaving logic
- Improve CONFIG_NUMA_BALANCING=y OOM behavior
One key practical area of improvement are enhancements to
the NUMA convergence of "multiple JVM" kind of workloads.
As a recap, this was -v17 performance with 4x SPECjbb instances
on a 4-node system (32 CPUs, 4 instances, 8 warehouses each, 240
seconds runtime, +THP):
spec1.txt: throughput = 177460.44 SPECjbb2005 bops
spec2.txt: throughput = 176175.08 SPECjbb2005 bops
spec3.txt: throughput = 175053.91 SPECjbb2005 bops
spec4.txt: throughput = 171383.52 SPECjbb2005 bops
--------------------------
SUM: throughput = 700072.95 SPECjbb2005 bops
The new -v18 figures are:
spec1.txt: throughput = 191415.52 SPECjbb2005 bops
spec2.txt: throughput = 193481.96 SPECjbb2005 bops
spec3.txt: throughput = 192865.30 SPECjbb2005 bops
spec4.txt: throughput = 191627.40 SPECjbb2005 bops
--------------------------
SUM: throughput = 769390.18 SPECjbb2005 bops
Which is 10% faster than -v17, 22% faster than mainline and it is
within 1% of the hard-binding results (where each JVM is explicitly
memory and CPU-bound to a single node each).
Occording to my measurements the -v18 NUMA kernel is also faster than
AutoNUMA (+THP-fix):
spec1.txt: throughput = 184327.49 SPECjbb2005 bops
spec2.txt: throughput = 187508.83 SPECjbb2005 bops
spec3.txt: throughput = 186206.44 SPECjbb2005 bops
spec4.txt: throughput = 188739.22 SPECjbb2005 bops
--------------------------
SUM: throughput = 746781.98 SPECjbb2005 bops
Mainline has the following 4x JVM performance:
spec1.txt: throughput = 157839.25 SPECjbb2005 bops
spec2.txt: throughput = 156969.15 SPECjbb2005 bops
spec3.txt: throughput = 157571.59 SPECjbb2005 bops
spec4.txt: throughput = 157873.86 SPECjbb2005 bops
--------------------------
SUM: throughput = 630253.85 SPECjbb2005 bops
Another key area of improvement is !THP (4K pages) performance.
Mainline 4x SPECjbb !THP JVM results:
spec1.txt: throughput = 128575.47 SPECjbb2005 bops
spec2.txt: throughput = 125767.24 SPECjbb2005 bops
spec3.txt: throughput = 130042.30 SPECjbb2005 bops
spec4.txt: throughput = 128155.32 SPECjbb2005 bops
--------------------------
SUM: throughput = 512540.33 SPECjbb2005 bops
numa/core -v18 4x SPECjbb JVM !THP results:
spec1.txt: throughput = 158023.05 SPECjbb2005 bops
spec2.txt: throughput = 156895.51 SPECjbb2005 bops
spec3.txt: throughput = 156158.11 SPECjbb2005 bops
spec4.txt: throughput = 157414.52 SPECjbb2005 bops
--------------------------
SUM: throughput = 628491.19 SPECjbb2005 bops
That too is roughly 22% faster than mainline - the !THP regression
that was reported by Mel Gorman appears to be fixed.
AutoNUMA-benchmark comparison to the mainline kernel:
##############
# res-v3.6-vanilla.log vs res-numacore-v18b.log:
#------------------------------------------------------------------------------------>
autonuma benchmark run time (lower is better) speedup %
------------------------------------------------------------------------------------->
numa01 : 337.29 vs. 177.64 | +89.8 %
numa01_THREAD_ALLOC : 428.79 vs. 127.07 | +237.4 %
numa02 : 56.32 vs. 18.08 | +211.5 %
------------------------------------------------------------
(this is similar to -v17, within noise.)
Comparison to AutoNUMA-v28 (+THP-fix):
##############
# res-autonuma-v28-THP.log vs res-numacore-v18b.log:
#------------------------------------------------------------------------------------>
autonuma benchmark run time (lower is better) speedup %
------------------------------------------------------------------------------------->
numa01 : 235.77 vs. 177.64 | +32.7 %
numa01_THREAD_ALLOC : 134.53 vs. 127.07 | +5.8 %
numa02 : 19.49 vs. 18.08 | +7.7 %
------------------------------------------------------------
A few caveats: I'm still seeing problems on !THP.
Here's the analysis of one of the last regression sources I'm still
seeing with it on larger systems. I have identified the source
of the regression, and I see how the AutoNUMA and 'balancenuma' trees
solved this problem - but I disagree with the solution.
When pushed hard enough via threaded workloads (for example via the
numa02 test) then the upstream page migration code in mm/migration.c
becomes unscalable, resulting in lot of scheduling on the anon vma
mutex and a subsequent drop in performance.
When the points of scheduling are call-graph profiled, the
unscalability appears to be due to interaction between the
following page migration code paths:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
>From what I can see theAutoNUMA and 'balancenuma' kernels works
around this !THP scalability issue by rate-limiting migrations.
For example balancenuma rate-limits migrations to about 1.2 GB/sec
bandwidth.
Rate-limiting to solve scalability limits is not the right
solution IMO, because it hurts cases where migration is justified.
The migration of the working set itself is not a problem, it would
in fact be beneficial - but our implementation of it does not scale
beyond a certain rate.
( THP, which has a 512 times lower natural rate of migration page
faults, does not run into this scalability limit. )
So this issue is still open and testers are encouraged to use THP
if they can.
These patches are on top of the "v17" tree (no point in resending those),
and it can all be found in the tip:master tree as well:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
Please re-report any bugs and regressions that you can still see.
Reports, fixes, suggestions are welcome, as always!
Thanks,
Ingo
--------------------->
Ingo Molnar (10):
sched: Add "task flipping" support
sched: Move the NUMA placement logic to a worklet
numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior
mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
sched: Introduce directed NUMA convergence
sched: Remove statistical NUMA scheduling
sched: Track quality and strength of convergence
sched: Converge NUMA migrations
sched: Add convergence strength based adaptive NUMA page fault rate
sched: Refine the 'shared tasks' memory interleaving logic
include/linux/migrate.h | 6 +
include/linux/sched.h | 12 +-
include/uapi/linux/mempolicy.h | 1 +
init/Kconfig | 1 +
kernel/sched/core.c | 99 ++-
kernel/sched/fair.c | 1913 ++++++++++++++++++++++++++++------------
kernel/sched/features.h | 24 +-
kernel/sched/sched.h | 19 +-
kernel/sysctl.c | 11 +-
mm/huge_memory.c | 50 +-
mm/memory.c | 151 +++-
mm/mempolicy.c | 86 +-
mm/migrate.c | 3 +-
mm/mprotect.c | 24 +-
14 files changed, 1699 insertions(+), 701 deletions(-)
--
1.7.11.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists