linux-kernel - [PATCH 39/52] sched: Track shared task's node groups and interleave their memory allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Sun,  2 Dec 2012 19:43:31 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Turner <pjt@...gle.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>
Subject: [PATCH 39/52] sched: Track shared task's node groups and interleave their memory allocations

This patch shows the power of the shared/private distinction: in
the shared tasks active balancing function (sched_update_ideal_cpu_shared())
we are able to to build a per (shared) task node mask of the nodes that
it and its buddies occupy at the moment.

Private tasks on the other hand are not affected and continue to do
efficient node-local allocations.

There's two important special cases:

 - if a group of shared tasks fits on a single node. In this case
   the interleaving happens on a single bit, a single node and thus
   turns into nice node-local allocations.

 - if a large group spans the whole system: in this case the node
   masks will cover the whole system, and all memory gets evenly
   interleaved and available RAM bandwidth gets utilized. This is
   preferable to allocating memory assymetrically and overloading
   certain CPU links and running into their bandwidth limitations.

This patch, in combination with the private/shared buddies patch,
optimizes the "4x JVM", "single JVM" and "2x JVM" SPECjbb workloads
on a 4-node system produce almost completely perfect memory placement.

For example a 4-JVM workload on a 4-node, 32-CPU system has
this performance (8 SPECjbb warehouses per JVM):

 spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
 spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
 spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
 spec4.txt:           throughput =     171383.52 SPECjbb2005 bops

Which is close to the hard binding performance figures.

while previously it would regress compared to mainline.

Mainline has the following 4x JVM performance:

 spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
 spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
 spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
 spec4.txt:           throughput =     157873.86 SPECjbb2005 bops

So the patch brings a 12% speedup.

This placement idea came while discussing interleaving strategies
with Christoph Lameter.

Suggested-by: Christoph Lameter <cl@...ux.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Andrea Arcangeli <aarcange@...hat.com>
Cc: Rik van Riel <riel@...hat.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Hugh Dickins <hughd@...gle.com>
Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3fb508..79f306c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -922,6 +922,10 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
 			buddies++;
 		}
 		WARN_ON_ONCE(buddies > full_buddies);
+		if (buddies)
+			node_set(node, p->numa_policy.v.nodes);
+		else
+			node_clear(node, p->numa_policy.v.nodes);
 
 		/* Don't go to a node that is already at full capacity: */
 		if (buddies == full_buddies)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/