lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1432753468-7785-3-git-send-email-riel@redhat.com>
Date:	Wed, 27 May 2015 15:04:28 -0400
From:	riel@...hat.com
To:	linux-kernel@...r.kernel.org
Cc:	mgorman@...e.de, jhladky@...hat.com, peterz@...radead.org,
	mingo@...nel.org, dedekind1@...il.com
Subject: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination

From: Rik van Riel <riel@...hat.com>

Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.

The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.

Simply reverting a43455a1 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.

This patch also addresses the issue that a system with a single runnable
thread would never migrate that thread to near its memory, introduced by
095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced").

A test where the main thread creates a large memory area, and spawns
a worker thread to iterate over the memory (placed on another node
by select_task_rq_fair), after which the main thread goes to sleep
and waits for the worker thread to loop over all the memory now sees
the worker thread migrated to where the memory is, instead of having
all the memory migrated over like before.

Jirka has run a number of performance tests on several systems:
single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
system, with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

Signed-off-by: Rik van Riel <riel@...hat.com>
Reported-by: Artem Bityutski <dedekind1@...il.com>
Reported-by: Jirka Hladky <jhladky@...hat.com>
Tested-by: Jirka Hladky <jhladky@...hat.com>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47bf0dffb34..f655f2ad155d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }
 
+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+	struct numa_stats *src = &env->src_stats;
+	struct numa_stats *dst = &env->dst_stats;
+
+	if (src->has_free_capacity && !dst->has_free_capacity)
+		return false;
+
+	/*
+	 * Only consider a task move if the source has a higher destination
+	 * than the destination, corrected for CPU capacity on each node.
+	 *
+	 *      src->load                dst->load
+	 * --------------------- vs ---------------------
+	 * src->compute_capacity    dst->compute_capacity
+	 */
+	if (src->load * dst->compute_capacity >
+	    dst->load * src->compute_capacity)
+		return true;
+
+	return false;
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
-	task_numa_find_cpu(&env, taskimp, groupimp);
+	if (numa_has_capacity(&env))
+		task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/*
 	 * Look at other nodes in these cases:
@@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, taskimp, groupimp);
+			if (numa_has_capacity(&env))
+				task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ