linux-kernel - [PATCH] sched,numa: document and fix numa_preferred

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150616155450.62ec234b@cuia.usersys.redhat.com>
Date:	Tue, 16 Jun 2015 15:54:50 -0400
From:	Rik van Riel <riel@...hat.com>
To:	linux-kernel@...r.kernel.org
Cc:	peterz@...radead.org, srikar@...ux.vnet.ibm.com, mingo@...nel.org,
	mgorman@...e.de
Subject: [PATCH] sched,numa: document and fix numa_preferred_nid setting

There are two places where the numa balancing code sets a task's
numa_preferred_nid.

The primary location is task_numa_placement(), where the kernel
examines the NUMA fault statistics to determine the location where
most of the memory that the task (or numa_group) accesses is.

The second location is only used for large workloads, where a
numa_group has enough tasks that the tasks are spread out over
several NUMA nodes, and multiple nodes are in the numa group's
active_nodes mask.

In order to allow those workloads to settle down, we pretend
that any node inside the numa_group's active_nodes mask is the
task's new preferred node. This dissuades task_numa_fault()
from continuously retrying to migrate the task to the group's
preferred node, and allows a multi-node workload to settle down.
This in turn improves locality of private faults inside a numa
group.

Reported-by: Shrikar Dronamraju <srikar@...ux.vnet.ibm.com>
Signed-off-by: Rik van Riel <riel@...hat.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2980e8733bc..54bb57f09e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1485,7 +1485,12 @@ static int task_numa_migrate(struct task_struct *p)
 				groupweight = group_weight(p, env.src_nid, dist);
 			}
 
-			/* Only consider nodes where both task and groups benefit */
+			/*
+			 * Only consider nodes where placement is better for
+			 * either the group (help large workloads converge),
+			 * or the task (placement of tasks within a numa group,
+			 * and single threaded processes).
+			 */
 			taskimp = task_weight(p, nid, dist) - taskweight;
 			groupimp = group_weight(p, nid, dist) - groupweight;
 			if (taskimp < 0 && groupimp < 0)
@@ -1499,12 +1504,14 @@ static int task_numa_migrate(struct task_struct *p)
 	}
 
 	/*
-	 * If the task is part of a workload that spans multiple NUMA nodes,
-	 * and is migrating into one of the workload's active nodes, remember
-	 * this node as the task's preferred numa node, so the workload can
-	 * settle down.
-	 * A task that migrated to a second choice node will be better off
-	 * trying for a better one later. Do not set the preferred node here.
+	 * The primary place for setting a task's numa_preferred_nid is in
+	 * task_numa_placement(). If a task is moved to a sub-optimal node,
+	 * leave numa_preferred_nid alone, so task_numa_fault() will retry
+	 * migrating the task to where it really belongs.
+	 * The exception is a task that belongs to a large numa_group, which
+	 * spans multiple NUMA nodes. If that task migrates into one of the
+	 * workload's active nodes, remember that node as the task's
+	 * numa_preferred_nid, so the workload can settle down.
 	 */
 	if (p->numa_group) {
 		if (env.best_cpu == -1)
@@ -1513,7 +1520,7 @@ static int task_numa_migrate(struct task_struct *p)
 			nid = env.dst_nid;
 
 		if (node_isset(nid, p->numa_group->active_nodes))
-			sched_setnuma(p, env.dst_nid);
+			sched_setnuma(p, nid);
 	}
 
 	/* No better CPU than the current one was found. */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/